public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v8 0/8] Add vhost-user support to passt. (part 3)
@ 2024-10-10 12:28 Laurent Vivier
  2024-10-10 12:28 ` [PATCH v8 1/8] packet: replace struct desc by struct iovec Laurent Vivier
                   ` (7 more replies)
  0 siblings, 8 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:28 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

This series of patches adds vhost-user support to passt
and then allows passt to connect to QEMU network backend using
virtqueue rather than a socket.

With QEMU, rather than using to connect:

  -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket

we will use:

  -chardev socket,id=chr0,path=/tmp/passt_1.socket
  -netdev vhost-user,id=netdev0,chardev=chr0
  -device virtio-net,netdev=netdev0
  -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE
  -numa node,memdev=memfd0

The memory backend is needed to share data between passt and QEMU.

Performance comparison between "-netdev stream" and "-netdev vhost-user":

$ iperf3 -c localhost -p 10001  -t 60 -6 -u -b 50G

socket:
[  5]   0.00-60.05  sec  95.6 GBytes  13.7 Gbits/sec  0.017 ms  6998988/10132413 (69%)  receiver
vhost-user:
[  5]   0.00-60.04  sec   237 GBytes  33.9 Gbits/sec  0.006 ms  53673/7813770 (0.69%)  receiver

$ iperf3 -c localhost -p 10001  -t 60 -4 -u -b 50G

socket:
[  5]   0.00-60.05  sec  98.9 GBytes  14.1 Gbits/sec  0.018 ms  6260735/9501832 (66%)  receiver
vhost-user:
[  5]   0.00-60.05  sec   235 GBytes  33.7 Gbits/sec  0.008 ms  37581/7752699 (0.48%)  receiver

$ iperf3 -c localhost -p 10001  -t 60 -6

socket:
[  5]   0.00-60.00  sec  17.3 GBytes  2.48 Gbits/sec    0             sender
[  5]   0.00-60.06  sec  17.3 GBytes  2.48 Gbits/sec                  receiver
vhost-user:
[  5]   0.00-60.00  sec   191 GBytes  27.4 Gbits/sec    0             sender
[  5]   0.00-60.05  sec   191 GBytes  27.3 Gbits/sec                  receiver

$ iperf3 -c localhost -p 10001  -t 60 -4

socket:
[  5]   0.00-60.00  sec  15.6 GBytes  2.24 Gbits/sec    0             sender
[  5]   0.00-60.06  sec  15.6 GBytes  2.24 Gbits/sec                  receiver
vhost-user:
[  5]   0.00-60.00  sec   189 GBytes  27.1 Gbits/sec    0             sender
[  5]   0.00-60.04  sec   189 GBytes  27.0 Gbits/sec                  receiver

v8:
  - remove iov_size() from vu_collect_one_frame()
  - move vu_packet_check_range() to vu_common.c
  - fix UDP when dlen is 0.

v7:
  - rebase
  - use vu_collect_one_frame() to do vu_collect() (collect multiple frame)
  - add vhost-user tests from Stefano

v6:
  - rebase
  - extract 3 patches from "vhost-user: add vhost-user":
      passt: rename tap_sock_init() to tap_backend_init()
      tcp: Export headers functions
      udp: Prepare udp.c to be shared with vhost-user
  - introduce new functions vu_collect_one_frame(),
    vu_collect(), vu_set_vnethdr(), vu_flush(), vu_send_single()
    to be called from tcp_vu.c, udp_vu.c and ICMP/DHCP where vhost-user
    code was duplicated.

v5:
  - rebase on top of 2024_09_06.6b38f07
  - rework udp_vu.c as ref.udp.v6 has been removed and we need to
    know if we receive IPv4 or IPv6 frame when we prepare the
    guest buffers for recvmsg()
  - remove vnet->hdrlen as the size is always the same with virtio-net v1
  - address comments from David and Stefano

v4:
  - rebase on top of 2024_08_21.1d6142f
    (rebasing on top of 620e19a1b48a ("udp: Merge udp[46]_mh_recv arrays")
     introduces a regression in the measure of the latency with UDP
     because I think I don't replace correctly ref.udp.v6 that is removed
     by this commit)
  - Addressed most of the comments from David and Stefano
    (I didn't want to postpone this version to next week,
     so I'll address the remaining comments in the next version).

v3:
  - rebase on top of flow table
  - update tcp_vu.c to look like udp_vu.c (recv()/prepare()/send_frame())
  - address comments from Stefano and David on version 2

v2:
  - remove PATCH 4
  - rewrite PATCH 2 and 3 to follow passt coding style
  - move some code from PATCH 3 to PATCH 4 (previously PATCH 5)
  - partially addressed David's comment on PATCH 5

Laurent Vivier (7):
  packet: replace struct desc by struct iovec
  vhost-user: introduce virtio API
  vhost-user: introduce vhost-user API
  udp: Prepare udp.c to be shared with vhost-user
  tcp: Export headers functions
  passt: rename tap_sock_init() to tap_backend_init()
  vhost-user: add vhost-user

Stefano Brivio (1):
  test: Add tests for passt in vhost-user mode

 Makefile               |   9 +-
 conf.c                 |  21 +-
 epoll_type.h           |   4 +
 iov.c                  |   1 -
 isolation.c            |  15 +-
 packet.c               |  91 ++--
 packet.h               |  22 +-
 passt.1                |  10 +-
 passt.c                |  11 +-
 passt.h                |   6 +
 pcap.c                 |   1 -
 tap.c                  | 128 ++++--
 tap.h                  |   7 +-
 tcp.c                  |  37 +-
 tcp_internal.h         |  15 +
 tcp_vu.c               | 476 ++++++++++++++++++++
 tcp_vu.h               |  12 +
 test/lib/perf_report   |  15 +
 test/lib/setup         |  77 +++-
 test/lib/setup_ugly    |   2 +-
 test/passt_vu          |   1 +
 test/passt_vu_in_ns    |   1 +
 test/perf/passt_vu_tcp | 211 +++++++++
 test/perf/passt_vu_udp | 159 +++++++
 test/run               |  25 ++
 test/two_guests_vu     |   1 +
 udp.c                  |  84 ++--
 udp_internal.h         |  34 ++
 udp_vu.c               | 336 ++++++++++++++
 udp_vu.h               |  13 +
 util.h                 |   8 +
 vhost_user.c           | 977 +++++++++++++++++++++++++++++++++++++++++
 vhost_user.h           | 206 +++++++++
 virtio.c               | 660 ++++++++++++++++++++++++++++
 virtio.h               | 184 ++++++++
 vu_common.c            | 385 ++++++++++++++++
 vu_common.h            |  47 ++
 37 files changed, 4138 insertions(+), 154 deletions(-)
 create mode 100644 tcp_vu.c
 create mode 100644 tcp_vu.h
 create mode 120000 test/passt_vu
 create mode 120000 test/passt_vu_in_ns
 create mode 100644 test/perf/passt_vu_tcp
 create mode 100644 test/perf/passt_vu_udp
 create mode 120000 test/two_guests_vu
 create mode 100644 udp_internal.h
 create mode 100644 udp_vu.c
 create mode 100644 udp_vu.h
 create mode 100644 vhost_user.c
 create mode 100644 vhost_user.h
 create mode 100644 virtio.c
 create mode 100644 virtio.h
 create mode 100644 vu_common.c
 create mode 100644 vu_common.h

-- 
2.46.2



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v8 1/8] packet: replace struct desc by struct iovec
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
@ 2024-10-10 12:28 ` Laurent Vivier
  2024-10-10 12:28 ` [PATCH v8 2/8] vhost-user: introduce virtio API Laurent Vivier
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:28 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.

We need a 64bit address, so replace struct desc by struct iovec
and update range checking.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 packet.c | 80 ++++++++++++++++++++++++++++++--------------------------
 packet.h | 14 ++--------
 2 files changed, 45 insertions(+), 49 deletions(-)

diff --git a/packet.c b/packet.c
index ccfc84607709..37489961a37e 100644
--- a/packet.c
+++ b/packet.c
@@ -22,6 +22,35 @@
 #include "util.h"
 #include "log.h"
 
+/**
+ * packet_check_range() - Check if a packet memory range is valid
+ * @p:		Packet pool
+ * @offset:	Offset of data range in packet descriptor
+ * @len:	Length of desired data range
+ * @start:	Start of the packet descriptor
+ * @func:	For tracing: name of calling function
+ * @line:	For tracing: caller line of function call
+ *
+ * Return: 0 if the range is valid, -1 otherwise
+ */
+static int packet_check_range(const struct pool *p, size_t offset, size_t len,
+			      const char *start, const char *func, int line)
+{
+	if (start < p->buf) {
+		trace("packet start %p before buffer start %p, "
+		      "%s:%i", (void *)start, (void *)p->buf, func, line);
+		return -1;
+	}
+
+	if (start + len + offset > p->buf + p->buf_size) {
+		trace("packet offset plus length %lu from size %lu, "
+		      "%s:%i", start - p->buf + len + offset,
+		      p->buf_size, func, line);
+		return -1;
+	}
+
+	return 0;
+}
 /**
  * packet_add_do() - Add data as packet descriptor to given pool
  * @p:		Existing pool
@@ -41,34 +70,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 		return;
 	}
 
-	if (start < p->buf) {
-		trace("add packet start %p before buffer start %p, %s:%i",
-		      (void *)start, (void *)p->buf, func, line);
+	if (packet_check_range(p, 0, len, start, func, line))
 		return;
-	}
-
-	if (start + len > p->buf + p->buf_size) {
-		trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
-		      (void *)start, len, (void *)(p->buf + p->buf_size),
-		      func, line);
-		return;
-	}
 
 	if (len > UINT16_MAX) {
 		trace("add packet length %zu, %s:%i", len, func, line);
 		return;
 	}
 
-#if UINTPTR_MAX == UINT64_MAX
-	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
-		trace("add packet start %p, buffer start %p, %s:%i",
-		      (void *)start, (void *)p->buf, func, line);
-		return;
-	}
-#endif
-
-	p->pkt[idx].offset = start - p->buf;
-	p->pkt[idx].len = len;
+	p->pkt[idx].iov_base = (void *)start;
+	p->pkt[idx].iov_len = len;
 
 	p->count++;
 }
@@ -96,36 +107,31 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (len > UINT16_MAX || len + offset > UINT32_MAX) {
+	if (len > UINT16_MAX) {
 		if (func) {
-			trace("packet data length %zu, offset %zu, %s:%i",
-			      len, offset, func, line);
+			trace("packet data length %zu, %s:%i",
+			      len, func, line);
 		}
 		return NULL;
 	}
 
-	if (p->pkt[idx].offset + len + offset > p->buf_size) {
+	if (len + offset > p->pkt[idx].iov_len) {
 		if (func) {
-			trace("packet offset plus length %zu from size %zu, "
-			      "%s:%i", p->pkt[idx].offset + len + offset,
-			      p->buf_size, func, line);
+			trace("data length %zu, offset %zu from length %zu, "
+			      "%s:%i", len, offset, p->pkt[idx].iov_len,
+			      func, line);
 		}
 		return NULL;
 	}
 
-	if (len + offset > p->pkt[idx].len) {
-		if (func) {
-			trace("data length %zu, offset %zu from length %u, "
-			      "%s:%i", len, offset, p->pkt[idx].len,
-			      func, line);
-		}
+	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
+			       func, line))
 		return NULL;
-	}
 
 	if (left)
-		*left = p->pkt[idx].len - offset - len;
+		*left = p->pkt[idx].iov_len - offset - len;
 
-	return p->buf + p->pkt[idx].offset + offset;
+	return (char *)p->pkt[idx].iov_base + offset;
 }
 
 /**
diff --git a/packet.h b/packet.h
index a784b07bbed5..8377dcf678bb 100644
--- a/packet.h
+++ b/packet.h
@@ -6,16 +6,6 @@
 #ifndef PACKET_H
 #define PACKET_H
 
-/**
- * struct desc - Generic offset-based descriptor within buffer
- * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
- * @len:	Length of descriptor, host order, 16-bit limit
- */
-struct desc {
-	uint32_t offset;
-	uint16_t len;
-};
-
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors
@@ -29,7 +19,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct desc pkt[1];
+	struct iovec pkt[1];
 };
 
 void packet_add_do(struct pool *p, size_t len, const char *start,
@@ -54,7 +44,7 @@ struct _name ## _t {							\
 	size_t buf_size;						\
 	size_t size;							\
 	size_t count;							\
-	struct desc pkt[_size];						\
+	struct iovec pkt[_size];					\
 }
 
 #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\
-- 
@@ -6,16 +6,6 @@
 #ifndef PACKET_H
 #define PACKET_H
 
-/**
- * struct desc - Generic offset-based descriptor within buffer
- * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
- * @len:	Length of descriptor, host order, 16-bit limit
- */
-struct desc {
-	uint32_t offset;
-	uint16_t len;
-};
-
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors
@@ -29,7 +19,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct desc pkt[1];
+	struct iovec pkt[1];
 };
 
 void packet_add_do(struct pool *p, size_t len, const char *start,
@@ -54,7 +44,7 @@ struct _name ## _t {							\
 	size_t buf_size;						\
 	size_t size;							\
 	size_t count;							\
-	struct desc pkt[_size];						\
+	struct iovec pkt[_size];					\
 }
 
 #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 2/8] vhost-user: introduce virtio API
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
  2024-10-10 12:28 ` [PATCH v8 1/8] packet: replace struct desc by struct iovec Laurent Vivier
@ 2024-10-10 12:28 ` Laurent Vivier
  2024-10-10 12:28 ` [PATCH v8 3/8] vhost-user: introduce vhost-user API Laurent Vivier
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:28 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile |   4 +-
 util.h   |   8 +
 virtio.c | 665 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 virtio.h | 183 +++++++++++++++
 4 files changed, 858 insertions(+), 2 deletions(-)
 create mode 100644 virtio.c
 create mode 100644 virtio.h

diff --git a/Makefile b/Makefile
index 74a95130d082..a2258891f104 100644
--- a/Makefile
+++ b/Makefile
@@ -54,7 +54,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
 	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c
+	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -64,7 +64,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
 	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	udp.h udp_flow.h util.h
+	udp.h udp_flow.h util.h virtio.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/util.h b/util.h
index 2c1e08e0c488..def145c997b7 100644
--- a/util.h
+++ b/util.h
@@ -131,6 +131,14 @@ static inline uint32_t ntohl_unaligned(const void *p)
 	return ntohl(val);
 }
 
+static inline void barrier(void) { __asm__ __volatile__("" ::: "memory"); }
+#define smp_mb()		do { barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); } while (0)
+#define smp_mb_release()	do { barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); } while (0)
+#define smp_mb_acquire()	do { barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); } while (0)
+
+#define smp_wmb()	smp_mb_release()
+#define smp_rmb()	smp_mb_acquire()
+
 #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 	     void *arg);
diff --git a/virtio.c b/virtio.c
new file mode 100644
index 000000000000..380590afbca3
--- /dev/null
+++ b/virtio.c
@@ -0,0 +1,665 @@
+// SPDX-License-Identifier: GPL-2.0-or-later AND BSD-3-Clause
+/*
+ * virtio API, vring and virtqueue functions definition
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+/* Some parts copied from QEMU subprojects/libvhost-user/libvhost-user.c
+ * originally licensed under the following terms:
+ *
+ * --
+ *
+ * Copyright IBM, Corp. 2007
+ * Copyright (c) 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Anthony Liguori <aliguori@us.ibm.com>
+ *  Marc-André Lureau <mlureau@redhat.com>
+ *  Victor Kaplansky <victork@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ *
+ * Some parts copied from QEMU hw/virtio/virtio.c
+ * licensed under the following terms:
+ *
+ * Copyright IBM, Corp. 2007
+ *
+ * Authors:
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * --
+ *
+ * virtq_used_event() and virtq_avail_event() from
+ * https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-712000A
+ * licensed under the following terms:
+ *
+ * --
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers.
+ *
+ * Copyright 2007, 2009, IBM Corporation
+ * Copyright 2011, Red Hat, Inc
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ‘‘AS IS’’ AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <stddef.h>
+#include <endian.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/eventfd.h>
+#include <sys/socket.h>
+
+#include "util.h"
+#include "virtio.h"
+
+#define VIRTQUEUE_MAX_SIZE 1024
+
+/**
+ * vu_gpa_to_va() - Translate guest physical address to our virtual address.
+ * @dev:	Vhost-user device
+ * @plen:	Physical length to map (input), capped to region (output)
+ * @guest_addr:	Guest physical address
+ *
+ * Return: virtual address in our address space of the guest physical address
+ */
+static void *vu_gpa_to_va(struct vu_dev *dev, uint64_t *plen, uint64_t guest_addr)
+{
+	unsigned int i;
+
+	if (*plen == 0)
+		return NULL;
+
+	/* Find matching memory region. */
+	for (i = 0; i < dev->nregions; i++) {
+		const struct vu_dev_region *r = &dev->regions[i];
+
+		if ((guest_addr >= r->gpa) &&
+		    (guest_addr < (r->gpa + r->size))) {
+			if ((guest_addr + *plen) > (r->gpa + r->size))
+				*plen = r->gpa + r->size - guest_addr;
+			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+			return (void *)(guest_addr - r->gpa + r->mmap_addr +
+						     r->mmap_offset);
+		}
+	}
+
+	return NULL;
+}
+
+/**
+ * vring_avail_flags() - Read the available ring flags
+ * @vq:		Virtqueue
+ *
+ * Return: the available ring descriptor flags of the given virtqueue
+ */
+static inline uint16_t vring_avail_flags(const struct vu_virtq *vq)
+{
+	return le16toh(vq->vring.avail->flags);
+}
+
+/**
+ * vring_avail_idx() - Read the available ring index
+ * @vq:		Virtqueue
+ *
+ * Return: the available ring index of the given virtqueue
+ */
+static inline uint16_t vring_avail_idx(struct vu_virtq *vq)
+{
+	vq->shadow_avail_idx = le16toh(vq->vring.avail->idx);
+
+	return vq->shadow_avail_idx;
+}
+
+/**
+ * vring_avail_ring() - Read an available ring entry
+ * @vq:		Virtqueue
+ * @i:		Index of the entry to read
+ *
+ * Return: the ring entry content (head of the descriptor chain)
+ */
+static inline uint16_t vring_avail_ring(const struct vu_virtq *vq, int i)
+{
+	return le16toh(vq->vring.avail->ring[i]);
+}
+
+/**
+ * virtq_used_event - Get location of used event indices
+ *		      (only with VIRTIO_F_EVENT_IDX)
+ * @vq		Virtqueue
+ *
+ * Return: return the location of the used event index
+ */
+static inline uint16_t *virtq_used_event(const struct vu_virtq *vq)
+{
+        /* For backwards compat, used event index is at *end* of avail ring. */
+        return &vq->vring.avail->ring[vq->vring.num];
+}
+
+/**
+ * vring_get_used_event() - Get the used event from the available ring
+ * @vq		Virtqueue
+ *
+ * Return: the used event (available only if VIRTIO_RING_F_EVENT_IDX is set)
+ *         used_event is a performant alternative where the driver
+ *         specifies how far the device can progress before a notification
+ *         is required.
+ */
+static inline uint16_t vring_get_used_event(const struct vu_virtq *vq)
+{
+	return le16toh(*virtq_used_event(vq));
+}
+
+/**
+ * virtqueue_get_head() - Get the head of the descriptor chain for a given
+ *                        index
+ * @vq:		Virtqueue
+ * @idx:	Available ring entry index
+ * @head:	Head of the descriptor chain
+ */
+static void virtqueue_get_head(const struct vu_virtq *vq,
+			       unsigned int idx, unsigned int *head)
+{
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen.
+	 */
+	*head = vring_avail_ring(vq, idx % vq->vring.num);
+
+	/* If their number is silly, that's a fatal mistake. */
+	if (*head >= vq->vring.num)
+		die("vhost-user: Guest says index %u is available", *head);
+}
+
+/**
+ * virtqueue_read_indirect_desc() - Copy virtio ring descriptors from guest
+ *                                  memory
+ * @dev:	Vhost-user device
+ * @desc:	Destination address to copy the descriptors to
+ * @addr:	Guest memory address to copy from
+ * @len:	Length of memory to copy
+ *
+ * Return: -1 if there is an error, 0 otherwise
+ */
+static int virtqueue_read_indirect_desc(struct vu_dev *dev, struct vring_desc *desc,
+					uint64_t addr, size_t len)
+{
+	uint64_t read_len;
+
+	if (len > (VIRTQUEUE_MAX_SIZE * sizeof(struct vring_desc)))
+		return -1;
+
+	if (len == 0)
+		return -1;
+
+	while (len) {
+		const struct vring_desc *orig_desc;
+
+		read_len = len;
+		orig_desc = vu_gpa_to_va(dev, &read_len, addr);
+		if (!orig_desc)
+			return -1;
+
+		memcpy(desc, orig_desc, read_len);
+		len -= read_len;
+		addr += read_len;
+		desc += read_len / sizeof(struct vring_desc);
+	}
+
+	return 0;
+}
+
+/**
+ * enum virtqueue_read_desc_state - State in the descriptor chain
+ * @VIRTQUEUE_READ_DESC_ERROR	Found an invalid descriptor
+ * @VIRTQUEUE_READ_DESC_DONE	No more descriptors in the chain
+ * @VIRTQUEUE_READ_DESC_MORE	there are more descriptors in the chain
+ */
+enum virtqueue_read_desc_state {
+	VIRTQUEUE_READ_DESC_ERROR = -1,
+	VIRTQUEUE_READ_DESC_DONE = 0,   /* end of chain */
+	VIRTQUEUE_READ_DESC_MORE = 1,   /* more buffers in chain */
+};
+
+/**
+ * virtqueue_read_next_desc() - Read the the next descriptor in the chain
+ * @desc:	Virtio ring descriptors
+ * @i:		Index of the current descriptor
+ * @max:	Maximum value of the descriptor index
+ * @next:	Index of the next descriptor in the chain (output value)
+ *
+ * Return: current chain descriptor state (error, next, done)
+ */
+static int virtqueue_read_next_desc(const struct vring_desc *desc,
+				    int i, unsigned int max, unsigned int *next)
+{
+	/* If this descriptor says it doesn't chain, we're done. */
+	if (!(le16toh(desc[i].flags) & VRING_DESC_F_NEXT))
+		return VIRTQUEUE_READ_DESC_DONE;
+
+	/* Check they're not leading us off end of descriptors. */
+	*next = le16toh(desc[i].next);
+	/* Make sure compiler knows to grab that: we don't want it changing! */
+	smp_wmb();
+
+	if (*next >= max)
+		return VIRTQUEUE_READ_DESC_ERROR;
+
+	return VIRTQUEUE_READ_DESC_MORE;
+}
+
+/**
+ * vu_queue_empty() - Check if virtqueue is empty
+ * @vq:		Virtqueue
+ *
+ * Return: true if the virtqueue is empty, false otherwise
+ */
+bool vu_queue_empty(struct vu_virtq *vq)
+{
+	if (!vq->vring.avail)
+		return true;
+
+	if (vq->shadow_avail_idx != vq->last_avail_idx)
+		return false;
+
+	return vring_avail_idx(vq) == vq->last_avail_idx;
+}
+
+/**
+ * vring_can_notify() - Check if a notification can be sent
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ *
+ * Return: true if notification can be sent
+ */
+static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
+{
+	uint16_t old, new;
+	bool v;
+
+	/* We need to expose used array entries before checking used event. */
+	smp_mb();
+
+	/* Always notify when queue is empty (when feature acknowledge) */
+	if (vu_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
+	    !vq->inuse && vu_queue_empty(vq))
+		return true;
+
+	if (!vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
+		return !(vring_avail_flags(vq) & VRING_AVAIL_F_NO_INTERRUPT);
+
+	v = vq->signalled_used_valid;
+	vq->signalled_used_valid = true;
+	old = vq->signalled_used;
+	new = vq->signalled_used = vq->used_idx;
+	return !v || vring_need_event(vring_get_used_event(vq), new, old);
+}
+
+/**
+ * vu_queue_notify() - Send a notification to the given virtqueue
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
+{
+	if (!vq->vring.avail)
+		return;
+
+	if (!vring_can_notify(dev, vq)) {
+		debug("vhost-user: virtqueue can skip notify...");
+		return;
+	}
+
+	if (eventfd_write(vq->call_fd, 1) < 0)
+		die_perror("Error writing vhost-user queue eventfd");
+}
+
+/* virtq_avail_event() -  Get location of available event indices
+ *			      (only with VIRTIO_F_EVENT_IDX)
+ * @vq:		Virtqueue
+ *
+ * Return: return the location of the available event index
+ */
+static inline uint16_t *virtq_avail_event(const struct vu_virtq *vq)
+{
+        /* For backwards compat, avail event index is at *end* of used ring. */
+        return (uint16_t *)&vq->vring.used->ring[vq->vring.num];
+}
+
+/**
+ * vring_set_avail_event() - Set avail_event
+ * @vq:		Virtqueue
+ * @val:	Value to set to avail_event
+ *		avail_event is used in the same way the used_event is in the
+ *		avail_ring.
+ *		avail_event is used to advise the driver that notifications
+ *		are unnecessary until the driver writes entry with an index
+ *		specified by avail_event into the available ring.
+ */
+static inline void vring_set_avail_event(const struct vu_virtq *vq,
+					 uint16_t val)
+{
+	uint16_t val_le = htole16(val);
+
+	if (!vq->notification)
+		return;
+
+	memcpy(virtq_avail_event(vq), &val_le, sizeof(val_le));
+}
+
+/**
+ * virtqueue_map_desc() - Translate descriptor ring physical address into our
+ * 			  virtual address space
+ * @dev:	Vhost-user device
+ * @p_num_sg:	First iov entry to use (input),
+ *		first iov entry not used (output)
+ * @iov:	Iov array to use to store buffer virtual addresses
+ * @max_num_sg:	Maximum number of iov entries
+ * @pa:		Guest physical address of the buffer to map into our virtual
+ * 		address
+ * @sz:		Size of the buffer
+ *
+ * Return: false on error, true otherwise
+ */
+static bool virtqueue_map_desc(struct vu_dev *dev,
+			       unsigned int *p_num_sg, struct iovec *iov,
+			       unsigned int max_num_sg,
+			       uint64_t pa, size_t sz)
+{
+	unsigned int num_sg = *p_num_sg;
+
+	ASSERT(num_sg < max_num_sg);
+	ASSERT(sz);
+
+	while (sz) {
+		uint64_t len = sz;
+
+		iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
+		if (iov[num_sg].iov_base == NULL)
+			die("vhost-user: invalid address for buffers");
+		iov[num_sg].iov_len = len;
+		num_sg++;
+		sz -= len;
+		pa += len;
+	}
+
+	*p_num_sg = num_sg;
+	return true;
+}
+
+/**
+ * vu_queue_map_desc - Map the virtqueue descriptor ring into our virtual
+ * 		       address space
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ * @idx:	First descriptor ring entry to map
+ * @elem:	Virtqueue element to store descriptor ring iov
+ *
+ * Return: -1 if there is an error, 0 otherwise
+ */
+static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned int idx,
+			     struct vu_virtq_element *elem)
+{
+	const struct vring_desc *desc = vq->vring.desc;
+	struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
+	unsigned int out_num = 0, in_num = 0;
+	unsigned int max = vq->vring.num;
+	unsigned int i = idx;
+	uint64_t read_len;
+	int rc;
+
+	if (le16toh(desc[i].flags) & VRING_DESC_F_INDIRECT) {
+		unsigned int desc_len;
+		uint64_t desc_addr;
+
+		if (le32toh(desc[i].len) % sizeof(struct vring_desc))
+			die("vhost-user: Invalid size for indirect buffer table");
+
+		/* loop over the indirect descriptor table */
+		desc_addr = le64toh(desc[i].addr);
+		desc_len = le32toh(desc[i].len);
+		max = desc_len / sizeof(struct vring_desc);
+		read_len = desc_len;
+		desc = vu_gpa_to_va(dev, &read_len, desc_addr);
+		if (desc && read_len != desc_len) {
+			/* Failed to use zero copy */
+			desc = NULL;
+			if (!virtqueue_read_indirect_desc(dev, desc_buf, desc_addr, desc_len))
+				desc = desc_buf;
+		}
+		if (!desc)
+			die("vhost-user: Invalid indirect buffer table");
+		i = 0;
+	}
+
+	/* Collect all the descriptors */
+	do {
+		if (le16toh(desc[i].flags) & VRING_DESC_F_WRITE) {
+			if (!virtqueue_map_desc(dev, &in_num, elem->in_sg,
+						elem->in_num,
+						le64toh(desc[i].addr),
+						le32toh(desc[i].len)))
+				return -1;
+		} else {
+			if (in_num)
+				die("Incorrect order for descriptors");
+			if (!virtqueue_map_desc(dev, &out_num, elem->out_sg,
+						elem->out_num,
+						le64toh(desc[i].addr),
+						le32toh(desc[i].len))) {
+				return -1;
+			}
+		}
+
+		/* If we've got too many, that implies a descriptor loop. */
+		if ((in_num + out_num) > max)
+			die("vhost-user: Loop in queue descriptor list");
+		rc = virtqueue_read_next_desc(desc, i, max, &i);
+	} while (rc == VIRTQUEUE_READ_DESC_MORE);
+
+	if (rc == VIRTQUEUE_READ_DESC_ERROR)
+		die("vhost-user: Failed to read descriptor list");
+
+	elem->index = idx;
+	elem->in_num = in_num;
+	elem->out_num = out_num;
+
+	return 0;
+}
+
+/**
+ * vu_queue_pop() - Pop an entry from the virtqueue
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ * @elem:	Virtqueue element to file with the entry information
+ *
+ * Return: -1 if there is an error, 0 otherwise
+ */
+/* cppcheck-suppress unusedFunction */
+int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
+{
+	unsigned int head;
+	int ret;
+
+	if (!vq->vring.avail)
+		return -1;
+
+	if (vu_queue_empty(vq))
+		return -1;
+
+	/* Needed after vu_queue_empty(), see comment in
+	 * virtqueue_num_heads().
+	 */
+	smp_rmb();
+
+	if (vq->inuse >= vq->vring.num)
+		die("vhost-user queue size exceeded");
+
+	virtqueue_get_head(vq, vq->last_avail_idx++, &head);
+
+	if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
+		vring_set_avail_event(vq, vq->last_avail_idx);
+
+	ret = vu_queue_map_desc(dev, vq, head, elem);
+
+	if (ret < 0)
+		return ret;
+
+	vq->inuse++;
+
+	return 0;
+}
+
+/**
+ * vu_queue_detach_element() - Detach an element from the virqueue
+ * @vq:		Virtqueue
+ */
+void vu_queue_detach_element(struct vu_virtq *vq)
+{
+	vq->inuse--;
+	/* unmap, when DMA support is added */
+}
+
+/**
+ * vu_queue_unpop() - Push back the previously popped element from the virqueue
+ * @vq:		Virtqueue
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_unpop(struct vu_virtq *vq)
+{
+	vq->last_avail_idx--;
+	vu_queue_detach_element(vq);
+}
+
+/**
+ * vu_queue_rewind() - Push back a given number of popped elements
+ * @vq:		Virtqueue
+ * @num:	Number of element to unpop
+ */
+/* cppcheck-suppress unusedFunction */
+bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
+{
+	if (num > vq->inuse)
+		return false;
+
+	vq->last_avail_idx -= num;
+	vq->inuse -= num;
+	return true;
+}
+
+/**
+ * vring_used_write() - Write an entry in the used ring
+ * @vq:		Virtqueue
+ * @uelem:	Entry to write
+ * @i:		Index of the entry in the used ring
+ */
+static inline void vring_used_write(struct vu_virtq *vq,
+				    const struct vring_used_elem *uelem, int i)
+{
+	struct vring_used *used = vq->vring.used;
+
+	used->ring[i] = *uelem;
+}
+
+/**
+ * vu_queue_fill_by_index() - Update information of a descriptor ring entry
+ *			      in the used ring
+ * @vq:		Virtqueue
+ * @index:	Descriptor ring index
+ * @len:	Size of the element
+ * @idx:	Used ring entry index
+ */
+void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx)
+{
+	struct vring_used_elem uelem;
+
+	if (!vq->vring.avail)
+		return;
+
+	idx = (idx + vq->used_idx) % vq->vring.num;
+
+	uelem.id = htole32(index);
+	uelem.len = htole32(len);
+	vring_used_write(vq, &uelem, idx);
+}
+
+/**
+ * vu_queue_fill() - Update information of a given element in the used ring
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ * @elem:	Element information to fill
+ * @len:	Size of the element
+ * @idx:	Used ring entry index
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
+		   unsigned int len, unsigned int idx)
+{
+	vu_queue_fill_by_index(vq, elem->index, len, idx);
+}
+
+/**
+ * vring_used_idx_set() - Set the descriptor ring current index
+ * @vq:		Virtqueue
+ * @val:	Value to set in the index
+ */
+static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
+{
+	vq->vring.used->idx = htole16(val);
+
+	vq->used_idx = val;
+}
+
+/**
+ * vu_queue_flush() - Flush the virtqueue
+ * @vq:		Virtqueue
+ * @count:	Number of entry to flush
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
+{
+	uint16_t old, new;
+
+	if (!vq->vring.avail)
+		return;
+
+	/* Make sure buffer is written before we update index. */
+	smp_wmb();
+
+	old = vq->used_idx;
+	new = old + count;
+	vring_used_idx_set(vq, new);
+	vq->inuse -= count;
+	if ((uint16_t)(new - vq->signalled_used) < (uint16_t)(new - old))
+		vq->signalled_used_valid = false;
+}
diff --git a/virtio.h b/virtio.h
new file mode 100644
index 000000000000..94efeb049fbc
--- /dev/null
+++ b/virtio.h
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio API, vring and virtqueue functions definition
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef VIRTIO_H
+#define VIRTIO_H
+
+#include <stdbool.h>
+#include <linux/vhost_types.h>
+
+/* Maximum size of a virtqueue */
+#define VIRTQUEUE_MAX_SIZE 1024
+
+/**
+ * struct vu_ring - Virtqueue rings
+ * @num:		Size of the queue
+ * @desc:		Descriptor ring
+ * @avail:		Available ring
+ * @used:		Used ring
+ * @log_guest_addr:	Guest address for logging
+ * @flags:		Vring flags
+ * 			VHOST_VRING_F_LOG is set if log address is valid
+ */
+struct vu_ring {
+	unsigned int num;
+	struct vring_desc *desc;
+	struct vring_avail *avail;
+	struct vring_used *used;
+	uint64_t log_guest_addr;
+	uint32_t flags;
+};
+
+/**
+ * struct vu_virtq - Virtqueue definition
+ * @vring:			Virtqueue rings
+ * @last_avail_idx:		Next head to pop
+ * @shadow_avail_idx:		Last avail_idx read from VQ.
+ * @used_idx:			Descriptor ring current index
+ * @signalled_used:		Last used index value we have signalled on
+ * @signalled_used_valid:	True if signalled_used if valid
+ * @notification:		True if the queues notify (via event
+ * 				index or interrupt)
+ * @inuse:			Number of entries in use
+ * @call_fd:			The event file descriptor to signal when
+ * 				buffers are used.
+ * @kick_fd:			The event file descriptor for adding
+ * 				buffers to the vring
+ * @err_fd:			The event file descriptor to signal when
+ * 				error occurs
+ * @enable:			True if the virtqueue is enabled
+ * @started:			True if the virtqueue is started
+ * @vra:			QEMU address of our rings
+ */
+struct vu_virtq {
+	struct vu_ring vring;
+	uint16_t last_avail_idx;
+	uint16_t shadow_avail_idx;
+	uint16_t used_idx;
+	uint16_t signalled_used;
+	bool signalled_used_valid;
+	bool notification;
+	unsigned int inuse;
+	int call_fd;
+	int kick_fd;
+	int err_fd;
+	unsigned int enable;
+	bool started;
+	struct vhost_vring_addr vra;
+};
+
+/**
+ * struct vu_dev_region - guest shared memory region
+ * @gpa:		Guest physical address of the region
+ * @size:		Memory size in bytes
+ * @qva:		QEMU virtual address
+ * @mmap_offset:	Offset where the region starts in the mapped memory
+ * @mmap_addr:		Address of the mapped memory
+ */
+struct vu_dev_region {
+	uint64_t gpa;
+	uint64_t size;
+	uint64_t qva;
+	uint64_t mmap_offset;
+	uint64_t mmap_addr;
+};
+
+#define VHOST_USER_MAX_QUEUES 2
+
+/*
+ * Set a reasonable maximum number of ram slots, which will be supported by
+ * any architecture.
+ */
+#define VHOST_USER_MAX_RAM_SLOTS 32
+
+/**
+ * struct vu_dev - vhost-user device information
+ * @context:		Execution context
+ * @nregions:		Number of shared memory regions
+ * @regions:		Guest shared memory regions
+ * @features:		Vhost-user features
+ * @protocol_features:	Vhost-user protocol features
+ */
+struct vu_dev {
+	uint32_t nregions;
+	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
+	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
+	uint64_t features;
+	uint64_t protocol_features;
+};
+
+/**
+ * struct vu_virtq_element - virtqueue element
+ * @index:	Descriptor ring index
+ * @out_num:	Number of outgoing iovec buffers
+ * @in_num:	Number of incoming iovec buffers
+ * @in_sg:	Incoming iovec buffers
+ * @out_sg:	Outgoing iovec buffers
+ */
+struct vu_virtq_element {
+	unsigned int index;
+	unsigned int out_num;
+	unsigned int in_num;
+	struct iovec *in_sg;
+	struct iovec *out_sg;
+};
+
+/**
+ * has_feature() - Check a feature bit in a features set
+ * @features:	Features set
+ * @fb:		Feature bit to check
+ *
+ * Return:	True if the feature bit is set
+ */
+static inline bool has_feature(uint64_t features, unsigned int fbit)
+{
+	return !!(features & (1ULL << fbit));
+}
+
+/**
+ * vu_has_feature() - Check if a virtio-net feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+static inline bool vu_has_feature(const struct vu_dev *vdev,
+				  unsigned int fbit)
+{
+	return has_feature(vdev->features, fbit);
+}
+
+/**
+ * vu_has_protocol_feature() - Check if a vhost-user feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+/* cppcheck-suppress unusedFunction */
+static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
+					   unsigned int fbit)
+{
+	return has_feature(vdev->protocol_features, fbit);
+}
+
+bool vu_queue_empty(struct vu_virtq *vq);
+void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
+int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq,
+		 struct vu_virtq_element *elem);
+void vu_queue_detach_element(struct vu_virtq *vq);
+void vu_queue_unpop(struct vu_virtq *vq);
+bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
+void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx);
+void vu_queue_fill(struct vu_virtq *vq,
+		   const struct vu_virtq_element *elem, unsigned int len,
+		   unsigned int idx);
+void vu_queue_flush(struct vu_virtq *vq, unsigned int count);
+#endif /* VIRTIO_H */
-- 
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio API, vring and virtqueue functions definition
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef VIRTIO_H
+#define VIRTIO_H
+
+#include <stdbool.h>
+#include <linux/vhost_types.h>
+
+/* Maximum size of a virtqueue */
+#define VIRTQUEUE_MAX_SIZE 1024
+
+/**
+ * struct vu_ring - Virtqueue rings
+ * @num:		Size of the queue
+ * @desc:		Descriptor ring
+ * @avail:		Available ring
+ * @used:		Used ring
+ * @log_guest_addr:	Guest address for logging
+ * @flags:		Vring flags
+ * 			VHOST_VRING_F_LOG is set if log address is valid
+ */
+struct vu_ring {
+	unsigned int num;
+	struct vring_desc *desc;
+	struct vring_avail *avail;
+	struct vring_used *used;
+	uint64_t log_guest_addr;
+	uint32_t flags;
+};
+
+/**
+ * struct vu_virtq - Virtqueue definition
+ * @vring:			Virtqueue rings
+ * @last_avail_idx:		Next head to pop
+ * @shadow_avail_idx:		Last avail_idx read from VQ.
+ * @used_idx:			Descriptor ring current index
+ * @signalled_used:		Last used index value we have signalled on
+ * @signalled_used_valid:	True if signalled_used if valid
+ * @notification:		True if the queues notify (via event
+ * 				index or interrupt)
+ * @inuse:			Number of entries in use
+ * @call_fd:			The event file descriptor to signal when
+ * 				buffers are used.
+ * @kick_fd:			The event file descriptor for adding
+ * 				buffers to the vring
+ * @err_fd:			The event file descriptor to signal when
+ * 				error occurs
+ * @enable:			True if the virtqueue is enabled
+ * @started:			True if the virtqueue is started
+ * @vra:			QEMU address of our rings
+ */
+struct vu_virtq {
+	struct vu_ring vring;
+	uint16_t last_avail_idx;
+	uint16_t shadow_avail_idx;
+	uint16_t used_idx;
+	uint16_t signalled_used;
+	bool signalled_used_valid;
+	bool notification;
+	unsigned int inuse;
+	int call_fd;
+	int kick_fd;
+	int err_fd;
+	unsigned int enable;
+	bool started;
+	struct vhost_vring_addr vra;
+};
+
+/**
+ * struct vu_dev_region - guest shared memory region
+ * @gpa:		Guest physical address of the region
+ * @size:		Memory size in bytes
+ * @qva:		QEMU virtual address
+ * @mmap_offset:	Offset where the region starts in the mapped memory
+ * @mmap_addr:		Address of the mapped memory
+ */
+struct vu_dev_region {
+	uint64_t gpa;
+	uint64_t size;
+	uint64_t qva;
+	uint64_t mmap_offset;
+	uint64_t mmap_addr;
+};
+
+#define VHOST_USER_MAX_QUEUES 2
+
+/*
+ * Set a reasonable maximum number of ram slots, which will be supported by
+ * any architecture.
+ */
+#define VHOST_USER_MAX_RAM_SLOTS 32
+
+/**
+ * struct vu_dev - vhost-user device information
+ * @context:		Execution context
+ * @nregions:		Number of shared memory regions
+ * @regions:		Guest shared memory regions
+ * @features:		Vhost-user features
+ * @protocol_features:	Vhost-user protocol features
+ */
+struct vu_dev {
+	uint32_t nregions;
+	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
+	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
+	uint64_t features;
+	uint64_t protocol_features;
+};
+
+/**
+ * struct vu_virtq_element - virtqueue element
+ * @index:	Descriptor ring index
+ * @out_num:	Number of outgoing iovec buffers
+ * @in_num:	Number of incoming iovec buffers
+ * @in_sg:	Incoming iovec buffers
+ * @out_sg:	Outgoing iovec buffers
+ */
+struct vu_virtq_element {
+	unsigned int index;
+	unsigned int out_num;
+	unsigned int in_num;
+	struct iovec *in_sg;
+	struct iovec *out_sg;
+};
+
+/**
+ * has_feature() - Check a feature bit in a features set
+ * @features:	Features set
+ * @fb:		Feature bit to check
+ *
+ * Return:	True if the feature bit is set
+ */
+static inline bool has_feature(uint64_t features, unsigned int fbit)
+{
+	return !!(features & (1ULL << fbit));
+}
+
+/**
+ * vu_has_feature() - Check if a virtio-net feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+static inline bool vu_has_feature(const struct vu_dev *vdev,
+				  unsigned int fbit)
+{
+	return has_feature(vdev->features, fbit);
+}
+
+/**
+ * vu_has_protocol_feature() - Check if a vhost-user feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+/* cppcheck-suppress unusedFunction */
+static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
+					   unsigned int fbit)
+{
+	return has_feature(vdev->protocol_features, fbit);
+}
+
+bool vu_queue_empty(struct vu_virtq *vq);
+void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
+int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq,
+		 struct vu_virtq_element *elem);
+void vu_queue_detach_element(struct vu_virtq *vq);
+void vu_queue_unpop(struct vu_virtq *vq);
+bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
+void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx);
+void vu_queue_fill(struct vu_virtq *vq,
+		   const struct vu_virtq_element *elem, unsigned int len,
+		   unsigned int idx);
+void vu_queue_flush(struct vu_virtq *vq, unsigned int count);
+#endif /* VIRTIO_H */
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 3/8] vhost-user: introduce vhost-user API
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
  2024-10-10 12:28 ` [PATCH v8 1/8] packet: replace struct desc by struct iovec Laurent Vivier
  2024-10-10 12:28 ` [PATCH v8 2/8] vhost-user: introduce virtio API Laurent Vivier
@ 2024-10-10 12:28 ` Laurent Vivier
  2024-10-10 12:28 ` [PATCH v8 4/8] udp: Prepare udp.c to be shared with vhost-user Laurent Vivier
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:28 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Add vhost_user.c and vhost_user.h that define the functions needed
to implement vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile     |   4 +-
 vhost_user.c | 970 +++++++++++++++++++++++++++++++++++++++++++++++++++
 vhost_user.h | 208 +++++++++++
 virtio.h     |   1 +
 4 files changed, 1181 insertions(+), 2 deletions(-)
 create mode 100644 vhost_user.c
 create mode 100644 vhost_user.h

diff --git a/Makefile b/Makefile
index a2258891f104..0e8ed60a0da1 100644
--- a/Makefile
+++ b/Makefile
@@ -54,7 +54,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
 	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
+	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -64,7 +64,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
 	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	udp.h udp_flow.h util.h virtio.h
+	udp.h udp_flow.h util.h vhost_user.h virtio.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/vhost_user.c b/vhost_user.c
new file mode 100644
index 000000000000..1e302926b8fe
--- /dev/null
+++ b/vhost_user.c
@@ -0,0 +1,970 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * vhost-user API, command management and virtio interface
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * Some parts from QEMU subprojects/libvhost-user/libvhost-user.c
+ * licensed under the following terms:
+ *
+ * Copyright IBM, Corp. 2007
+ * Copyright (c) 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Anthony Liguori <aliguori@us.ibm.com>
+ *  Marc-André Lureau <mlureau@redhat.com>
+ *  Victor Kaplansky <victork@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <inttypes.h>
+#include <time.h>
+#include <net/ethernet.h>
+#include <netinet/in.h>
+#include <sys/epoll.h>
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <linux/vhost_types.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "passt.h"
+#include "tap.h"
+#include "vhost_user.h"
+#include "pcap.h"
+
+/* vhost-user version we are compatible with */
+#define VHOST_USER_VERSION 1
+
+/**
+ * vu_print_capabilities() - print vhost-user capabilities
+ * 			     this is part of the vhost-user backend
+ * 			     convention.
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_print_capabilities(void)
+{
+	info("{");
+	info("  \"type\": \"net\"");
+	info("}");
+	exit(EXIT_SUCCESS);
+}
+
+/**
+ * vu_request_to_string() - convert a vhost-user request number to its name
+ * @req:	request number
+ *
+ * Return: the name of request number
+ */
+static const char *vu_request_to_string(unsigned int req)
+{
+	if (req < VHOST_USER_MAX) {
+#define REQ(req) [req] = #req
+		static const char * const vu_request_str[VHOST_USER_MAX] = {
+			REQ(VHOST_USER_NONE),
+			REQ(VHOST_USER_GET_FEATURES),
+			REQ(VHOST_USER_SET_FEATURES),
+			REQ(VHOST_USER_SET_OWNER),
+			REQ(VHOST_USER_RESET_OWNER),
+			REQ(VHOST_USER_SET_MEM_TABLE),
+			REQ(VHOST_USER_SET_LOG_BASE),
+			REQ(VHOST_USER_SET_LOG_FD),
+			REQ(VHOST_USER_SET_VRING_NUM),
+			REQ(VHOST_USER_SET_VRING_ADDR),
+			REQ(VHOST_USER_SET_VRING_BASE),
+			REQ(VHOST_USER_GET_VRING_BASE),
+			REQ(VHOST_USER_SET_VRING_KICK),
+			REQ(VHOST_USER_SET_VRING_CALL),
+			REQ(VHOST_USER_SET_VRING_ERR),
+			REQ(VHOST_USER_GET_PROTOCOL_FEATURES),
+			REQ(VHOST_USER_SET_PROTOCOL_FEATURES),
+			REQ(VHOST_USER_GET_QUEUE_NUM),
+			REQ(VHOST_USER_SET_VRING_ENABLE),
+			REQ(VHOST_USER_SEND_RARP),
+			REQ(VHOST_USER_NET_SET_MTU),
+			REQ(VHOST_USER_SET_BACKEND_REQ_FD),
+			REQ(VHOST_USER_IOTLB_MSG),
+			REQ(VHOST_USER_SET_VRING_ENDIAN),
+			REQ(VHOST_USER_GET_CONFIG),
+			REQ(VHOST_USER_SET_CONFIG),
+			REQ(VHOST_USER_POSTCOPY_ADVISE),
+			REQ(VHOST_USER_POSTCOPY_LISTEN),
+			REQ(VHOST_USER_POSTCOPY_END),
+			REQ(VHOST_USER_GET_INFLIGHT_FD),
+			REQ(VHOST_USER_SET_INFLIGHT_FD),
+			REQ(VHOST_USER_GPU_SET_SOCKET),
+			REQ(VHOST_USER_VRING_KICK),
+			REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
+			REQ(VHOST_USER_ADD_MEM_REG),
+			REQ(VHOST_USER_REM_MEM_REG),
+		};
+#undef REQ
+		return vu_request_str[req];
+	}
+
+	return "unknown";
+}
+
+/**
+ * qva_to_va() -  Translate front-end (QEMU) virtual address to our virtual
+ * 		  address
+ * @dev:		vhost-user device
+ * @qemu_addr:		front-end userspace address
+ *
+ * Return: the memory address in our process virtual address space.
+ */
+static void *qva_to_va(struct vu_dev *dev, uint64_t qemu_addr)
+{
+	unsigned int i;
+
+	/* Find matching memory region.  */
+	for (i = 0; i < dev->nregions; i++) {
+		const struct vu_dev_region *r = &dev->regions[i];
+
+		if ((qemu_addr >= r->qva) && (qemu_addr < (r->qva + r->size))) {
+			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+			return (void *)(qemu_addr - r->qva + r->mmap_addr +
+					r->mmap_offset);
+		}
+	}
+
+	return NULL;
+}
+
+/**
+ * vmsg_close_fds() - Close all file descriptors of a given message
+ * @vmsg:	vhost-user message with the list of the file descriptors
+ */
+static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
+{
+	int i;
+
+	for (i = 0; i < vmsg->fd_num; i++)
+		close(vmsg->fds[i]);
+}
+
+/**
+ * vu_remove_watch() - Remove a file descriptor from our passt epoll
+ * 		       file descriptor
+ * @vdev:	vhost-user device
+ * @fd:		file descriptor to remove
+ */
+static void vu_remove_watch(const struct vu_dev *vdev, int fd)
+{
+	/* Placeholder to add passt related code */
+	(void)vdev;
+	(void)fd;
+}
+
+/**
+ * vmsg_set_reply_u64() - Set reply payload.u64 and clear request flags
+ * 			  and fd_num
+ * @vmsg:	vhost-user message
+ * @val:	64-bit value to reply
+ */
+static void vmsg_set_reply_u64(struct vhost_user_msg *vmsg, uint64_t val)
+{
+	vmsg->hdr.flags = 0; /* defaults will be set by vu_send_reply() */
+	vmsg->hdr.size = sizeof(vmsg->payload.u64);
+	vmsg->payload.u64 = val;
+	vmsg->fd_num = 0;
+}
+
+/**
+ * vu_message_read_default() - Read incoming vhost-user message from the
+ * 			       front-end
+ * @conn_fd:	vhost-user command socket
+ * @vmsg:	vhost-user message
+ *
+ * Return:  0 if recvmsg() has been interrupted or if there's no data to read,
+ *          1 if a message has been received
+ */
+static int vu_message_read_default(int conn_fd, struct vhost_user_msg *vmsg)
+{
+	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
+		     sizeof(int))] = { 0 };
+	struct iovec iov = {
+		.iov_base = (char *)vmsg,
+		.iov_len = VHOST_USER_HDR_SIZE,
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = control,
+		.msg_controllen = sizeof(control),
+	};
+	ssize_t ret, sz_payload;
+	struct cmsghdr *cmsg;
+
+	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
+	if (ret < 0) {
+		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
+			return 0;
+		die_perror("vhost-user message receive (recvmsg)");
+	}
+
+	vmsg->fd_num = 0;
+	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
+	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+		if (cmsg->cmsg_level == SOL_SOCKET &&
+		    cmsg->cmsg_type == SCM_RIGHTS) {
+			size_t fd_size;
+
+			ASSERT(cmsg->cmsg_len >= CMSG_LEN(0));
+			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
+			ASSERT(fd_size <= sizeof(vmsg->fds));
+			vmsg->fd_num = fd_size / sizeof(int);
+			memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);
+			break;
+		}
+	}
+
+	sz_payload = vmsg->hdr.size;
+	if ((size_t)sz_payload > sizeof(vmsg->payload)) {
+		die("vhost-user message request too big: %d,"
+			 " size: vmsg->size: %zd, "
+			 "while sizeof(vmsg->payload) = %zu",
+			 vmsg->hdr.request, sz_payload, sizeof(vmsg->payload));
+	}
+
+	if (sz_payload) {
+		do
+			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
+		while (ret < 0 && errno == EINTR);
+
+		if (ret < 0)
+			die_perror("vhost-user message receive");
+
+		if (ret == 0)
+			die("EOF on vhost-user message receive");
+
+		if (ret < sz_payload)
+			die("Short-read on vhost-user message receive");
+	}
+
+	return 1;
+}
+
+/**
+ * vu_message_write() - Send a message to the front-end
+ * @conn_fd:	vhost-user command socket
+ * @vmsg:	vhost-user message
+ *
+ * #syscalls:vu sendmsg
+ */
+static void vu_message_write(int conn_fd, struct vhost_user_msg *vmsg)
+{
+	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = { 0 };
+	struct iovec iov = {
+		.iov_base = (char *)vmsg,
+		.iov_len = VHOST_USER_HDR_SIZE + vmsg->hdr.size,
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = control,
+	};
+	int rc;
+
+	ASSERT(vmsg->fd_num <= VHOST_MEMORY_BASELINE_NREGIONS);
+	if (vmsg->fd_num > 0) {
+		size_t fdsize = vmsg->fd_num * sizeof(int);
+		struct cmsghdr *cmsg;
+
+		msg.msg_controllen = CMSG_SPACE(fdsize);
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(fdsize);
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+		memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
+	}
+
+	do
+		rc = sendmsg(conn_fd, &msg, 0);
+	while (rc < 0 && errno == EINTR);
+
+	if (rc < 0)
+		die_perror("vhost-user message send");
+
+	if ((uint32_t)rc < VHOST_USER_HDR_SIZE + vmsg->hdr.size)
+		die("EOF on vhost-user message send");
+}
+
+/**
+ * vu_send_reply() - Update message flags and send it to front-end
+ * @conn_fd:	vhost-user command socket
+ * @vmsg:	vhost-user message
+ */
+static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
+{
+	msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
+	msg->hdr.flags |= VHOST_USER_VERSION;
+	msg->hdr.flags |= VHOST_USER_REPLY_MASK;
+
+	vu_message_write(conn_fd, msg);
+}
+
+/**
+ * vu_get_features_exec() - Provide back-end features bitmask to front-end
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: True as a reply is requested
+ */
+static bool vu_get_features_exec(struct vu_dev *vdev,
+				 struct vhost_user_msg *msg)
+{
+	uint64_t features =
+		1ULL << VIRTIO_F_VERSION_1 |
+		1ULL << VIRTIO_NET_F_MRG_RXBUF |
+		1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
+
+	(void)vdev;
+
+	vmsg_set_reply_u64(msg, features);
+
+	debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
+
+	return true;
+}
+
+/**
+ * vu_set_enable_all_rings() - Enable/disable all the virtqueues
+ * @vdev:	vhost-user device
+ * @enable:	New virtqueues state
+ */
+static void vu_set_enable_all_rings(struct vu_dev *vdev, bool enable)
+{
+	uint16_t i;
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++)
+		vdev->vq[i].enable = enable;
+}
+
+/**
+ * vu_set_features_exec() - Enable features of the back-end
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_features_exec(struct vu_dev *vdev,
+				 struct vhost_user_msg *msg)
+{
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vdev->features = msg->payload.u64;
+	/* We only support devices conforming to VIRTIO 1.0 or
+	 * later
+	 */
+	if (!vu_has_feature(vdev, VIRTIO_F_VERSION_1))
+		die("virtio legacy devices aren't supported by passt");
+
+	if (!vu_has_feature(vdev, VHOST_USER_F_PROTOCOL_FEATURES))
+		vu_set_enable_all_rings(vdev, true);
+
+	return false;
+}
+
+/**
+ * vu_set_owner_exec() - Session start flag, do nothing in our case
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_owner_exec(struct vu_dev *vdev,
+			      struct vhost_user_msg *msg)
+{
+	(void)vdev;
+	(void)msg;
+
+	return false;
+}
+
+/**
+ * map_ring() - Convert ring front-end (QEMU) addresses to our process
+ * 		virtual address space.
+ * @vdev:	vhost-user device
+ * @vq:		Virtqueue
+ *
+ * Return: True if ring cannot be mapped to our address space
+ */
+static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
+{
+	vq->vring.desc = qva_to_va(vdev, vq->vra.desc_user_addr);
+	vq->vring.used = qva_to_va(vdev, vq->vra.used_user_addr);
+	vq->vring.avail = qva_to_va(vdev, vq->vra.avail_user_addr);
+
+	debug("Setting virtq addresses:");
+	debug("    vring_desc  at %p", (void *)vq->vring.desc);
+	debug("    vring_used  at %p", (void *)vq->vring.used);
+	debug("    vring_avail at %p", (void *)vq->vring.avail);
+
+	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
+}
+
+/**
+ * vu_set_mem_table_exec() - Sets the memory map regions to be able to
+ * 			     translate the vring addresses.
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ *
+ * #syscalls:vu mmap munmap
+ */
+static bool vu_set_mem_table_exec(struct vu_dev *vdev,
+				  struct vhost_user_msg *msg)
+{
+	struct vhost_user_memory m = msg->payload.memory, *memory = &m;
+	unsigned int i;
+
+	for (i = 0; i < vdev->nregions; i++) {
+		struct vu_dev_region *r = &vdev->regions[i];
+		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+		void *mm = (void *)r->mmap_addr;
+
+		if (mm)
+			munmap(mm, r->size + r->mmap_offset);
+	}
+	vdev->nregions = memory->nregions;
+
+	debug("vhost-user nregions: %u", memory->nregions);
+	for (i = 0; i < vdev->nregions; i++) {
+		struct vhost_user_memory_region *msg_region = &memory->regions[i];
+		struct vu_dev_region *dev_region = &vdev->regions[i];
+		void *mmap_addr;
+
+		debug("vhost-user region %d", i);
+		debug("    guest_phys_addr: 0x%016"PRIx64,
+		      msg_region->guest_phys_addr);
+		debug("    memory_size:     0x%016"PRIx64,
+		      msg_region->memory_size);
+		debug("    userspace_addr   0x%016"PRIx64,
+		      msg_region->userspace_addr);
+		debug("    mmap_offset      0x%016"PRIx64,
+		      msg_region->mmap_offset);
+
+		dev_region->gpa = msg_region->guest_phys_addr;
+		dev_region->size = msg_region->memory_size;
+		dev_region->qva = msg_region->userspace_addr;
+		dev_region->mmap_offset = msg_region->mmap_offset;
+
+		/* We don't use offset argument of mmap() since the
+		 * mapped address has to be page aligned.
+		 */
+		mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
+				 PROT_READ | PROT_WRITE, MAP_SHARED |
+				 MAP_NORESERVE, msg->fds[i], 0);
+
+		if (mmap_addr == MAP_FAILED)
+			die_perror("vhost-user region mmap error");
+
+		dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+		debug("    mmap_addr:       0x%016"PRIx64,
+		      dev_region->mmap_addr);
+
+		close(msg->fds[i]);
+	}
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		if (vdev->vq[i].vring.desc) {
+			if (map_ring(vdev, &vdev->vq[i]))
+				die("remapping queue %d during setmemtable", i);
+		}
+	}
+
+	return false;
+}
+
+/**
+ * vu_set_vring_num_exec() - Set the size of the queue (vring size)
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_vring_num_exec(struct vu_dev *vdev,
+				  struct vhost_user_msg *msg)
+{
+	unsigned int idx = msg->payload.state.index;
+	unsigned int num = msg->payload.state.num;
+
+	debug("State.index: %u", idx);
+	debug("State.num:   %u", num);
+	vdev->vq[idx].vring.num = num;
+
+	return false;
+}
+
+/**
+ * vu_set_vring_addr_exec() - Set the addresses of the vring
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	/* We need to copy the payload to vhost_vring_addr structure
+         * to access index because address of msg->payload.addr
+         * can be unaligned as it is packed.
+         */
+	struct vhost_vring_addr addr = msg->payload.addr;
+	struct vu_virtq *vq = &vdev->vq[addr.index];
+
+	debug("vhost_vring_addr:");
+	debug("    index:  %d", addr.index);
+	debug("    flags:  %d", addr.flags);
+	debug("    desc_user_addr:   0x%016" PRIx64,
+	      (uint64_t)addr.desc_user_addr);
+	debug("    used_user_addr:   0x%016" PRIx64,
+	      (uint64_t)addr.used_user_addr);
+	debug("    avail_user_addr:  0x%016" PRIx64,
+	      (uint64_t)addr.avail_user_addr);
+	debug("    log_guest_addr:   0x%016" PRIx64,
+	      (uint64_t)addr.log_guest_addr);
+
+	vq->vra = msg->payload.addr;
+	vq->vring.flags = addr.flags;
+	vq->vring.log_guest_addr = addr.log_guest_addr;
+
+	if (map_ring(vdev, vq))
+		die("Invalid vring_addr message");
+
+	vq->used_idx = le16toh(vq->vring.used->idx);
+
+	if (vq->last_avail_idx != vq->used_idx) {
+		debug("Last avail index != used index: %u != %u",
+		      vq->last_avail_idx, vq->used_idx);
+	}
+
+	return false;
+}
+/**
+ * vu_set_vring_base_exec() - Sets the next index to use for descriptors
+ * 			      in this vring
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_vring_base_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	unsigned int idx = msg->payload.state.index;
+	unsigned int num = msg->payload.state.num;
+
+	debug("State.index: %u", idx);
+	debug("State.num:   %u", num);
+	vdev->vq[idx].shadow_avail_idx = vdev->vq[idx].last_avail_idx = num;
+
+	return false;
+}
+
+/**
+ * vu_get_vring_base_exec() - Stops the vring and returns the current
+ * 			      descriptor index or indices
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: True as a reply is requested
+ */
+static bool vu_get_vring_base_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	unsigned int idx = msg->payload.state.index;
+
+	debug("State.index: %u", idx);
+	msg->payload.state.num = vdev->vq[idx].last_avail_idx;
+	msg->hdr.size = sizeof(msg->payload.state);
+
+	vdev->vq[idx].started = false;
+
+	if (vdev->vq[idx].call_fd != -1) {
+		close(vdev->vq[idx].call_fd);
+		vdev->vq[idx].call_fd = -1;
+	}
+	if (vdev->vq[idx].kick_fd != -1) {
+		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
+		close(vdev->vq[idx].kick_fd);
+		vdev->vq[idx].kick_fd = -1;
+	}
+
+	return true;
+}
+
+/**
+ * vu_set_watch() - Add a file descriptor to the passt epoll file descriptor
+ * @vdev:	vhost-user device
+ * @fd:		file descriptor to add
+ */
+static void vu_set_watch(const struct vu_dev *vdev, int fd)
+{
+	/* Placeholder to add passt related code */
+	(void)vdev;
+	(void)fd;
+}
+
+/**
+ * vu_check_queue_msg_file() - Check if a message is valid,
+ * 			       close fds if NOFD bit is set
+ * @vmsg:	vhost-user message
+ */
+static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	if (idx >= VHOST_USER_MAX_QUEUES)
+		die("Invalid vhost-user queue index: %u", idx);
+
+	if (nofd) {
+		vmsg_close_fds(msg);
+		return;
+	}
+
+	if (msg->fd_num != 1)
+		die("Invalid fds in vhost-user request: %d", msg->hdr.request);
+}
+
+/**
+ * vu_set_vring_kick_exec() - Set the event file descriptor for adding buffers
+ * 			      to the vring
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vu_check_queue_msg_file(msg);
+
+	if (vdev->vq[idx].kick_fd != -1) {
+		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
+		close(vdev->vq[idx].kick_fd);
+		vdev->vq[idx].kick_fd = -1;
+	}
+
+	if (!nofd)
+		vdev->vq[idx].kick_fd = msg->fds[0];
+
+	debug("Got kick_fd: %d for vq: %d", vdev->vq[idx].kick_fd, idx);
+
+	vdev->vq[idx].started = true;
+
+	if (vdev->vq[idx].kick_fd != -1 && VHOST_USER_IS_QUEUE_TX(idx)) {
+		vu_set_watch(vdev, vdev->vq[idx].kick_fd);
+		debug("Waiting for kicks on fd: %d for vq: %d",
+		      vdev->vq[idx].kick_fd, idx);
+	}
+
+	return false;
+}
+
+/**
+ * vu_set_vring_call_exec() - Set the event file descriptor to signal when
+ * 			      buffers are used
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_vring_call_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vu_check_queue_msg_file(msg);
+
+	if (vdev->vq[idx].call_fd != -1) {
+		close(vdev->vq[idx].call_fd);
+		vdev->vq[idx].call_fd = -1;
+	}
+
+	if (!nofd)
+		vdev->vq[idx].call_fd = msg->fds[0];
+
+	/* in case of I/O hang after reconnecting */
+	if (vdev->vq[idx].call_fd != -1)
+		eventfd_write(msg->fds[0], 1);
+
+	debug("Got call_fd: %d for vq: %d", vdev->vq[idx].call_fd, idx);
+
+	return false;
+}
+
+/**
+ * vu_set_vring_err_exec() - Set the event file descriptor to signal when
+ * 			     error occurs
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_vring_err_exec(struct vu_dev *vdev,
+				  struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vu_check_queue_msg_file(msg);
+
+	if (vdev->vq[idx].err_fd != -1) {
+		close(vdev->vq[idx].err_fd);
+		vdev->vq[idx].err_fd = -1;
+	}
+
+	if (!nofd)
+		vdev->vq[idx].err_fd = msg->fds[0];
+
+	return false;
+}
+
+/**
+ * vu_get_protocol_features_exec() - Provide the protocol (vhost-user) features
+ * 				     to the front-end
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: True as a reply is requested
+ */
+static bool vu_get_protocol_features_exec(struct vu_dev *vdev,
+					  struct vhost_user_msg *msg)
+{
+	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
+
+	(void)vdev;
+	vmsg_set_reply_u64(msg, features);
+
+	return true;
+}
+
+/**
+ * vu_set_protocol_features_exec() - Enable protocol (vhost-user) features
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
+					  struct vhost_user_msg *msg)
+{
+	uint64_t features = msg->payload.u64;
+
+	debug("u64: 0x%016"PRIx64, features);
+
+	vdev->protocol_features = msg->payload.u64;
+
+	return false;
+}
+
+/**
+ * vu_get_queue_num_exec() - Tell how many queues we support
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: True as a reply is requested
+ */
+static bool vu_get_queue_num_exec(struct vu_dev *vdev,
+				  struct vhost_user_msg *msg)
+{
+	(void)vdev;
+
+	vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
+
+	return true;
+}
+
+/**
+ * vu_set_vring_enable_exec() - Enable or disable corresponding vring
+ * @vdev:	vhost-user device
+ * @vmsg:	vhost-user message
+ *
+ * Return: False as no reply is requested
+ */
+static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
+				     struct vhost_user_msg *msg)
+{
+	unsigned int enable = msg->payload.state.num;
+	unsigned int idx = msg->payload.state.index;
+
+	debug("State.index:  %u", idx);
+	debug("State.enable: %u", enable);
+
+	if (idx >= VHOST_USER_MAX_QUEUES)
+		die("Invalid vring_enable index: %u", idx);
+
+	vdev->vq[idx].enable = enable;
+	return false;
+}
+
+/**
+ * vu_init() - Initialize vhost-user device structure
+ * @c:		execution context
+ * @vdev:	vhost-user device
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_init(struct ctx *c, struct vu_dev *vdev)
+{
+	int i;
+
+	vdev->context = c;
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		vdev->vq[i] = (struct vu_virtq){
+			.call_fd = -1,
+			.kick_fd = -1,
+			.err_fd = -1,
+			.notification = true,
+		};
+	}
+}
+
+/**
+ * vu_cleanup() - Reset vhost-user device
+ * @vdev:	vhost-user device
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_cleanup(struct vu_dev *vdev)
+{
+	unsigned int i;
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		struct vu_virtq *vq = &vdev->vq[i];
+
+		vq->started = false;
+		vq->notification = true;
+
+		if (vq->call_fd != -1) {
+			close(vq->call_fd);
+			vq->call_fd = -1;
+		}
+		if (vq->err_fd != -1) {
+			close(vq->err_fd);
+			vq->err_fd = -1;
+		}
+		if (vq->kick_fd != -1) {
+			vu_remove_watch(vdev, vq->kick_fd);
+			close(vq->kick_fd);
+			vq->kick_fd = -1;
+		}
+
+		vq->vring.desc = 0;
+		vq->vring.used = 0;
+		vq->vring.avail = 0;
+	}
+
+	for (i = 0; i < vdev->nregions; i++) {
+		const struct vu_dev_region *r = &vdev->regions[i];
+		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+		void *m = (void *)r->mmap_addr;
+
+		if (m)
+			munmap(m, r->size + r->mmap_offset);
+	}
+	vdev->nregions = 0;
+}
+
+/**
+ * vu_sock_reset() - Reset connection socket
+ * @vdev:	vhost-user device
+ */
+static void vu_sock_reset(struct vu_dev *vdev)
+{
+	/* Placeholder to add passt related code */
+	(void)vdev;
+}
+
+static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
+					struct vhost_user_msg *msg) = {
+	[VHOST_USER_GET_FEATURES]	   = vu_get_features_exec,
+	[VHOST_USER_SET_FEATURES]	   = vu_set_features_exec,
+	[VHOST_USER_GET_PROTOCOL_FEATURES] = vu_get_protocol_features_exec,
+	[VHOST_USER_SET_PROTOCOL_FEATURES] = vu_set_protocol_features_exec,
+	[VHOST_USER_GET_QUEUE_NUM]	   = vu_get_queue_num_exec,
+	[VHOST_USER_SET_OWNER]		   = vu_set_owner_exec,
+	[VHOST_USER_SET_MEM_TABLE]	   = vu_set_mem_table_exec,
+	[VHOST_USER_SET_VRING_NUM]	   = vu_set_vring_num_exec,
+	[VHOST_USER_SET_VRING_ADDR]	   = vu_set_vring_addr_exec,
+	[VHOST_USER_SET_VRING_BASE]	   = vu_set_vring_base_exec,
+	[VHOST_USER_GET_VRING_BASE]	   = vu_get_vring_base_exec,
+	[VHOST_USER_SET_VRING_KICK]	   = vu_set_vring_kick_exec,
+	[VHOST_USER_SET_VRING_CALL]	   = vu_set_vring_call_exec,
+	[VHOST_USER_SET_VRING_ERR]	   = vu_set_vring_err_exec,
+	[VHOST_USER_SET_VRING_ENABLE]	   = vu_set_vring_enable_exec,
+};
+
+/**
+ * vu_control_handler() - Handle control commands for vhost-user
+ * @vdev:	vhost-user device
+ * @fd:		vhost-user message socket
+ * @events:	epoll events
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
+{
+	struct vhost_user_msg msg = { 0 };
+	bool need_reply, reply_requested;
+	int ret;
+
+	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
+		vu_sock_reset(vdev);
+		return;
+	}
+
+	ret = vu_message_read_default(fd, &msg);
+	if (ret == 0) {
+		vu_sock_reset(vdev);
+		return;
+	}
+	debug("================ Vhost user message ================");
+	debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
+		msg.hdr.request);
+	debug("Flags:   0x%x", msg.hdr.flags);
+	debug("Size:    %u", msg.hdr.size);
+
+	need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
+
+	if (msg.hdr.request >= 0 && msg.hdr.request < VHOST_USER_MAX &&
+	    vu_handle[msg.hdr.request])
+		reply_requested = vu_handle[msg.hdr.request](vdev, &msg);
+	else
+		die("Unhandled request: %d", msg.hdr.request);
+
+	/* cppcheck-suppress legacyUninitvar */
+	if (!reply_requested && need_reply) {
+		msg.payload.u64 = 0;
+		msg.hdr.flags = 0;
+		msg.hdr.size = sizeof(msg.payload.u64);
+		msg.fd_num = 0;
+		reply_requested = true;
+	}
+
+	if (reply_requested)
+		vu_send_reply(fd, &msg);
+}
diff --git a/vhost_user.h b/vhost_user.h
new file mode 100644
index 000000000000..5af349ba58b8
--- /dev/null
+++ b/vhost_user.h
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * vhost-user API, command management and virtio interface
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+/* some parts from subprojects/libvhost-user/libvhost-user.h */
+
+#ifndef VHOST_USER_H
+#define VHOST_USER_H
+
+#include "virtio.h"
+#include "iov.h"
+
+#define VHOST_USER_F_PROTOCOL_FEATURES 30
+
+#define VHOST_MEMORY_BASELINE_NREGIONS 8
+
+/**
+ * enum vhost_user_protocol_feature - List of available vhost-user features
+ */
+enum vhost_user_protocol_feature {
+	VHOST_USER_PROTOCOL_F_MQ = 0,
+	VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
+	VHOST_USER_PROTOCOL_F_RARP = 2,
+	VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
+	VHOST_USER_PROTOCOL_F_NET_MTU = 4,
+	VHOST_USER_PROTOCOL_F_BACKEND_REQ = 5,
+	VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
+	VHOST_USER_PROTOCOL_F_CRYPTO_SESSION = 7,
+	VHOST_USER_PROTOCOL_F_PAGEFAULT = 8,
+	VHOST_USER_PROTOCOL_F_CONFIG = 9,
+	VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
+	VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
+	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
+	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+
+	VHOST_USER_PROTOCOL_F_MAX
+};
+
+/**
+ * enum vhost_user_request - List of available vhost-user requests
+ */
+enum vhost_user_request {
+	VHOST_USER_NONE = 0,
+	VHOST_USER_GET_FEATURES = 1,
+	VHOST_USER_SET_FEATURES = 2,
+	VHOST_USER_SET_OWNER = 3,
+	VHOST_USER_RESET_OWNER = 4,
+	VHOST_USER_SET_MEM_TABLE = 5,
+	VHOST_USER_SET_LOG_BASE = 6,
+	VHOST_USER_SET_LOG_FD = 7,
+	VHOST_USER_SET_VRING_NUM = 8,
+	VHOST_USER_SET_VRING_ADDR = 9,
+	VHOST_USER_SET_VRING_BASE = 10,
+	VHOST_USER_GET_VRING_BASE = 11,
+	VHOST_USER_SET_VRING_KICK = 12,
+	VHOST_USER_SET_VRING_CALL = 13,
+	VHOST_USER_SET_VRING_ERR = 14,
+	VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+	VHOST_USER_GET_QUEUE_NUM = 17,
+	VHOST_USER_SET_VRING_ENABLE = 18,
+	VHOST_USER_SEND_RARP = 19,
+	VHOST_USER_NET_SET_MTU = 20,
+	VHOST_USER_SET_BACKEND_REQ_FD = 21,
+	VHOST_USER_IOTLB_MSG = 22,
+	VHOST_USER_SET_VRING_ENDIAN = 23,
+	VHOST_USER_GET_CONFIG = 24,
+	VHOST_USER_SET_CONFIG = 25,
+	VHOST_USER_CREATE_CRYPTO_SESSION = 26,
+	VHOST_USER_CLOSE_CRYPTO_SESSION = 27,
+	VHOST_USER_POSTCOPY_ADVISE  = 28,
+	VHOST_USER_POSTCOPY_LISTEN  = 29,
+	VHOST_USER_POSTCOPY_END     = 30,
+	VHOST_USER_GET_INFLIGHT_FD = 31,
+	VHOST_USER_SET_INFLIGHT_FD = 32,
+	VHOST_USER_GPU_SET_SOCKET = 33,
+	VHOST_USER_VRING_KICK = 35,
+	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
+	VHOST_USER_ADD_MEM_REG = 37,
+	VHOST_USER_REM_MEM_REG = 38,
+	VHOST_USER_MAX
+};
+
+/**
+ * struct vhost_user_header - vhost-user message header
+ * @request:	Request type of the message
+ * @flags:	Request flags
+ * @size:	The following payload size
+ */
+struct vhost_user_header {
+	enum vhost_user_request request;
+
+#define VHOST_USER_VERSION_MASK     0x3
+#define VHOST_USER_REPLY_MASK       (0x1 << 2)
+#define VHOST_USER_NEED_REPLY_MASK  (0x1 << 3)
+	uint32_t flags;
+	uint32_t size;
+} __attribute__ ((__packed__));
+
+/**
+ * struct vhost_user_memory_region - Front-end shared memory region information
+ * @guest_phys_addr:	Guest physical address of the region
+ * @memory_size:	Memory size
+ * @userspace_addr:	front-end (QEMU) userspace address
+ * @mmap_offset:	region offset in the shared memory area
+ */
+struct vhost_user_memory_region {
+	uint64_t guest_phys_addr;
+	uint64_t memory_size;
+	uint64_t userspace_addr;
+	uint64_t mmap_offset;
+};
+
+/**
+ * struct vhost_user_memory - List of all the shared memory regions
+ * @nregions:	Number of memory regions
+ * @padding:	Padding
+ * @regions:	Memory regions list
+ */
+struct vhost_user_memory {
+	uint32_t nregions;
+	uint32_t padding;
+	struct vhost_user_memory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
+};
+
+/**
+ * union vhost_user_payload - vhost-user message payload
+ * @u64:		64-bit payload
+ * @state:		vring state payload
+ * @addr:		vring addresses payload
+ * vhost_user_memory:	Memory regions information payload
+ */
+union vhost_user_payload {
+#define VHOST_USER_VRING_IDX_MASK   0xff
+#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
+	uint64_t u64;
+	struct vhost_vring_state state;
+	struct vhost_vring_addr addr;
+	struct vhost_user_memory memory;
+};
+
+/**
+ * struct vhost_user_msg - vhost-use message
+ * @hdr:		Message header
+ * @payload:		Message payload
+ * @fds:		File descriptors associated with the message
+ * 			in the ancillary data.
+ * 			(shared memory or event file descriptors)
+ * @fd_num:		Number of file descriptors
+ */
+struct vhost_user_msg {
+	struct vhost_user_header hdr;
+	union vhost_user_payload payload;
+
+	int fds[VHOST_MEMORY_BASELINE_NREGIONS];
+	int fd_num;
+} __attribute__ ((__packed__));
+#define VHOST_USER_HDR_SIZE sizeof(struct vhost_user_header)
+
+/* index of the RX virtqueue */
+#define VHOST_USER_RX_QUEUE 0
+/* index of the TX virtqueue */
+#define VHOST_USER_TX_QUEUE 1
+
+/* in case of multiqueue, the RX and TX queues are interleaved */
+#define VHOST_USER_IS_QUEUE_TX(n)	(n % 2)
+#define VHOST_USER_IS_QUEUE_RX(n)	(!(n % 2))
+
+/* Default virtio-net header for passt */
+#define VU_HEADER ((struct virtio_net_hdr){	\
+	.flags = VIRTIO_NET_HDR_F_DATA_VALID,	\
+	.gso_type = VIRTIO_NET_HDR_GSO_NONE,	\
+})
+
+/**
+ * vu_queue_enabled - Return state of a virtqueue
+ * @vq:		virtqueue to check
+ *
+ * Return: true if the virqueue is enabled, false otherwise
+ */
+/* cppcheck-suppress unusedFunction */
+static inline bool vu_queue_enabled(const struct vu_virtq *vq)
+{
+	return vq->enable;
+}
+
+/**
+ * vu_queue_started - Return state of a virtqueue
+ * @vq:		virtqueue to check
+ *
+ * Return: true if the virqueue is started, false otherwise
+ */
+/* cppcheck-suppress unusedFunction */
+static inline bool vu_queue_started(const struct vu_virtq *vq)
+{
+	return vq->started;
+}
+
+void vu_print_capabilities(void);
+void vu_init(struct ctx *c, struct vu_dev *vdev);
+void vu_cleanup(struct vu_dev *vdev);
+void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);
+#endif /* VHOST_USER_H */
diff --git a/virtio.h b/virtio.h
index 94efeb049fbc..6410d60f9b3f 100644
--- a/virtio.h
+++ b/virtio.h
@@ -105,6 +105,7 @@ struct vu_dev_region {
  * @protocol_features:	Vhost-user protocol features
  */
 struct vu_dev {
+	struct ctx *context;
 	uint32_t nregions;
 	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
 	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
-- 
@@ -105,6 +105,7 @@ struct vu_dev_region {
  * @protocol_features:	Vhost-user protocol features
  */
 struct vu_dev {
+	struct ctx *context;
 	uint32_t nregions;
 	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
 	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 4/8] udp: Prepare udp.c to be shared with vhost-user
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
                   ` (2 preceding siblings ...)
  2024-10-10 12:28 ` [PATCH v8 3/8] vhost-user: introduce vhost-user API Laurent Vivier
@ 2024-10-10 12:28 ` Laurent Vivier
  2024-10-14  4:29   ` David Gibson
  2024-10-10 12:28 ` [PATCH v8 5/8] tcp: Export headers functions Laurent Vivier
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:28 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and
udp_sock_errs().

Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and
udp_reply_sock_handler to udp_buf_reply_sock_handler().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 udp.c          | 74 ++++++++++++++++++++++++++++++--------------------
 udp_internal.h | 34 +++++++++++++++++++++++
 2 files changed, 79 insertions(+), 29 deletions(-)
 create mode 100644 udp_internal.h

diff --git a/udp.c b/udp.c
index 100610f2472e..8fc5d8099310 100644
--- a/udp.c
+++ b/udp.c
@@ -109,8 +109,7 @@
 #include "pcap.h"
 #include "log.h"
 #include "flow_table.h"
-
-#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+#include "udp_internal.h"
 
 /* "Spliced" sockets indexed by bound port (host order) */
 static int udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
@@ -118,20 +117,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
 
 /* Static buffers */
 
-/**
- * struct udp_payload_t - UDP header and data for inbound messages
- * @uh:		UDP header
- * @data:	UDP data
- */
-static struct udp_payload_t {
-	struct udphdr uh;
-	char data[USHRT_MAX - sizeof(struct udphdr)];
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)))
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
-#endif
-udp_payload[UDP_MAX_FRAMES];
+/* UDP header and data for inbound messages */
+static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
 
 /* Ethernet header for IPv4 frames */
 static struct ethhdr udp4_eth_hdr;
@@ -302,9 +289,9 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
  *
  * Return: size of IPv4 payload (UDP header + data)
  */
-static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
-			      const struct flowside *toside, size_t dlen,
-			      bool no_udp_csum)
+size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen,
+		       bool no_udp_csum)
 {
 	const struct in_addr *src = inany_v4(&toside->oaddr);
 	const struct in_addr *dst = inany_v4(&toside->eaddr);
@@ -345,9 +332,9 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
  *
  * Return: size of IPv6 payload (UDP header + data)
  */
-static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
-			      const struct flowside *toside, size_t dlen,
-			      bool no_udp_csum)
+size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen,
+		       bool no_udp_csum)
 {
 	uint16_t l4len = dlen + sizeof(bp->uh);
 
@@ -477,7 +464,7 @@ static int udp_sock_recverr(int s)
  *
  * Return: Number of errors handled, or < 0 if we have an unrecoverable error
  */
-static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
+int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
 {
 	unsigned n_err = 0;
 	socklen_t errlen;
@@ -554,7 +541,7 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
 }
 
 /**
- * udp_listen_sock_handler() - Handle new data from socket
+ * udp_buf_listen_sock_handler() - Handle new data from socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -562,8 +549,9 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
  *
  * #syscalls recvmmsg
  */
-void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
-			     uint32_t events, const struct timespec *now)
+static void udp_buf_listen_sock_handler(const struct ctx *c,
+					union epoll_ref ref, uint32_t events,
+					const struct timespec *now)
 {
 	const socklen_t sasize = sizeof(udp_meta[0].s_in);
 	int n, i;
@@ -630,7 +618,21 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 }
 
 /**
- * udp_reply_sock_handler() - Handle new data from flow specific socket
+ * udp_listen_sock_handler() - Handle new data from socket
+ * @c:		Execution context
+ * @ref:	epoll reference
+ * @events:	epoll events bitmap
+ * @now:	Current timestamp
+ */
+void udp_listen_sock_handler(const struct ctx *c,
+			     union epoll_ref ref, uint32_t events,
+			     const struct timespec *now)
+{
+	udp_buf_listen_sock_handler(c, ref, events, now);
+}
+
+/**
+ * udp_buf_reply_sock_handler() - Handle new data from flow specific socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -638,8 +640,9 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
  *
  * #syscalls recvmmsg
  */
-void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			    uint32_t events, const struct timespec *now)
+static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+				       uint32_t events,
+				       const struct timespec *now)
 {
 	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
 	const struct flowside *toside = flowside_at_sidx(tosidx);
@@ -684,6 +687,19 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 	}
 }
 
+/**
+ * udp_reply_sock_handler() - Handle new data from flow specific socket
+ * @c:		Execution context
+ * @ref:	epoll reference
+ * @events:	epoll events bitmap
+ * @now:	Current timestamp
+ */
+void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+			    uint32_t events, const struct timespec *now)
+{
+	udp_buf_reply_sock_handler(c, ref, events, now);
+}
+
 /**
  * udp_tap_handler() - Handle packets from tap
  * @c:		Execution context
diff --git a/udp_internal.h b/udp_internal.h
new file mode 100644
index 000000000000..cc80e3055423
--- /dev/null
+++ b/udp_internal.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef UDP_INTERNAL_H
+#define UDP_INTERNAL_H
+
+#include "tap.h" /* needed by udp_meta_t */
+
+#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+
+/**
+ * struct udp_payload_t - UDP header and data for inbound messages
+ * @uh:		UDP header
+ * @data:	UDP data
+ */
+struct udp_payload_t {
+	struct udphdr uh;
+	char data[USHRT_MAX - sizeof(struct udphdr)];
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)));
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
+#endif
+
+size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen,
+		       bool no_udp_csum);
+size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
+                       const struct flowside *toside, size_t dlen,
+		       bool no_udp_csum);
+int udp_sock_errs(const struct ctx *c, int s, uint32_t events);
+#endif /* UDP_INTERNAL_H */
-- 
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef UDP_INTERNAL_H
+#define UDP_INTERNAL_H
+
+#include "tap.h" /* needed by udp_meta_t */
+
+#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+
+/**
+ * struct udp_payload_t - UDP header and data for inbound messages
+ * @uh:		UDP header
+ * @data:	UDP data
+ */
+struct udp_payload_t {
+	struct udphdr uh;
+	char data[USHRT_MAX - sizeof(struct udphdr)];
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)));
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
+#endif
+
+size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen,
+		       bool no_udp_csum);
+size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
+                       const struct flowside *toside, size_t dlen,
+		       bool no_udp_csum);
+int udp_sock_errs(const struct ctx *c, int s, uint32_t events);
+#endif /* UDP_INTERNAL_H */
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 5/8] tcp: Export headers functions
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
                   ` (3 preceding siblings ...)
  2024-10-10 12:28 ` [PATCH v8 4/8] udp: Prepare udp.c to be shared with vhost-user Laurent Vivier
@ 2024-10-10 12:28 ` Laurent Vivier
  2024-10-14  4:29   ` David Gibson
  2024-10-10 12:29 ` [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init() Laurent Vivier
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:28 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Export tcp_fill_headers[4|6]() and tcp_update_check_tcp[4|6]().

They'll be needed by vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 tcp.c          | 30 +++++++++++++++---------------
 tcp_internal.h | 15 +++++++++++++++
 2 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/tcp.c b/tcp.c
index 9617b7ac2404..eae02b1647e3 100644
--- a/tcp.c
+++ b/tcp.c
@@ -761,9 +761,9 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
  * @iov_cnt:	Length of the array
  * @l4offset:	IPv4 payload offset in the iovec array
  */
-static void tcp_update_check_tcp4(const struct iphdr *iph,
-				  const struct iovec *iov, int iov_cnt,
-				  size_t l4offset)
+void tcp_update_check_tcp4(const struct iphdr *iph,
+			   const struct iovec *iov, int iov_cnt,
+			   size_t l4offset)
 {
 	uint16_t l4len = ntohs(iph->tot_len) - sizeof(struct iphdr);
 	struct in_addr saddr = { .s_addr = iph->saddr };
@@ -813,9 +813,9 @@ static void tcp_update_check_tcp4(const struct iphdr *iph,
  * @iov_cnt:	Length of the array
  * @l4offset:	IPv6 payload offset in the iovec array
  */
-static void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
-				  const struct iovec *iov, int iov_cnt,
-				  size_t l4offset)
+void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
+			   const struct iovec *iov, int iov_cnt,
+			   size_t l4offset)
 {
 	uint16_t l4len = ntohs(ip6h->payload_len);
 	size_t check_ofs;
@@ -982,11 +982,11 @@ static void tcp_fill_header(struct tcphdr *th,
  *
  * Return: The IPv4 payload length, host order
  */
-static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
-				struct tap_hdr *taph,
-				struct iphdr *iph, struct tcp_payload_t *bp,
-				size_t dlen, const uint16_t *check,
-				uint32_t seq, bool no_tcp_csum)
+size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
+			 struct tap_hdr *taph,
+			 struct iphdr *iph, struct tcp_payload_t *bp,
+			 size_t dlen, const uint16_t *check,
+			 uint32_t seq, bool no_tcp_csum)
 {
 	const struct flowside *tapside = TAPFLOW(conn);
 	const struct in_addr *src4 = inany_v4(&tapside->oaddr);
@@ -1034,10 +1034,10 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
  *
  * Return: The IPv6 payload length, host order
  */
-static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
-				struct tap_hdr *taph,
-				struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
-				size_t dlen, uint32_t seq, bool no_tcp_csum)
+size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
+			 struct tap_hdr *taph,
+			 struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
+			 size_t dlen, uint32_t seq, bool no_tcp_csum)
 {
 	const struct flowside *tapside = TAPFLOW(conn);
 	size_t l4len = dlen + sizeof(bp->th);
diff --git a/tcp_internal.h b/tcp_internal.h
index 2f74ffeff8f3..8e87f98b470f 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -118,6 +118,21 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
 		tcp_rst_do(c, conn);					\
 	} while (0)
 
+void tcp_update_check_tcp4(const struct iphdr *iph,
+			   const struct iovec *iov, int iov_cnt,
+			   size_t l4offset);
+void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
+			   const struct iovec *iov, int iov_cnt,
+			   size_t l4offset);
+size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
+			 struct tap_hdr *taph,
+			 struct iphdr *iph, struct tcp_payload_t *bp,
+			 size_t dlen, const uint16_t *check,
+			 uint32_t seq, bool no_tcp_csum);
+size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
+			 struct tap_hdr *taph,
+			 struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
+			 size_t dlen, uint32_t seq, bool no_tcp_csum);
 size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq,
-- 
@@ -118,6 +118,21 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
 		tcp_rst_do(c, conn);					\
 	} while (0)
 
+void tcp_update_check_tcp4(const struct iphdr *iph,
+			   const struct iovec *iov, int iov_cnt,
+			   size_t l4offset);
+void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
+			   const struct iovec *iov, int iov_cnt,
+			   size_t l4offset);
+size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
+			 struct tap_hdr *taph,
+			 struct iphdr *iph, struct tcp_payload_t *bp,
+			 size_t dlen, const uint16_t *check,
+			 uint32_t seq, bool no_tcp_csum);
+size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
+			 struct tap_hdr *taph,
+			 struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
+			 size_t dlen, uint32_t seq, bool no_tcp_csum);
 size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq,
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init()
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
                   ` (4 preceding siblings ...)
  2024-10-10 12:28 ` [PATCH v8 5/8] tcp: Export headers functions Laurent Vivier
@ 2024-10-10 12:29 ` Laurent Vivier
  2024-10-14  4:30   ` David Gibson
  2024-10-14 22:38   ` Stefano Brivio
  2024-10-10 12:29 ` [PATCH v8 7/8] vhost-user: add vhost-user Laurent Vivier
  2024-10-10 12:29 ` [PATCH v8 8/8] test: Add tests for passt in vhost-user mode Laurent Vivier
  7 siblings, 2 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:29 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Extract pool storage initialization loop to tap_sock_update_pool(),
extract QEMU hints to tap_backend_show_hints().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 passt.c |  2 +-
 tap.c   | 56 +++++++++++++++++++++++++++++++++++++++++---------------
 tap.h   |  2 +-
 3 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/passt.c b/passt.c
index ad6f0bc32df6..79093ee02d62 100644
--- a/passt.c
+++ b/passt.c
@@ -261,7 +261,7 @@ int main(int argc, char **argv)
 
 	pasta_netns_quit_init(&c);
 
-	tap_sock_init(&c);
+	tap_backend_init(&c);
 
 	secret_init(&c);
 
diff --git a/tap.c b/tap.c
index c53a39b79e62..4b826fdf7adc 100644
--- a/tap.c
+++ b/tap.c
@@ -1188,11 +1188,31 @@ int tap_sock_unix_open(char *sock_path)
 	return fd;
 }
 
+/**
+ * tap_backend_show_hints() - Give help information to start QEMU
+ * @c:		Execution context
+ */
+static void tap_backend_show_hints(struct ctx *c)
+{
+	switch(c->mode) {
+	case MODE_PASTA:
+		/* No hints */
+		break;
+	case MODE_PASST:
+		info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
+		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
+		     c->sock_path);
+		info("or qrap, for earlier qemu versions:");
+		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
+		break;
+	}
+}
+
 /**
  * tap_sock_unix_init() - Start listening for connections on AF_UNIX socket
  * @c:		Execution context
  */
-static void tap_sock_unix_init(struct ctx *c)
+static void tap_sock_unix_init(const struct ctx *c)
 {
 	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
 	struct epoll_event ev = { 0 };
@@ -1203,12 +1223,6 @@ static void tap_sock_unix_init(struct ctx *c)
 	ev.events = EPOLLIN | EPOLLET;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
-
-	info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
-	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
-	     c->sock_path);
-	info("or qrap, for earlier qemu versions:");
-	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
 }
 
 /**
@@ -1321,21 +1335,31 @@ static void tap_sock_tun_init(struct ctx *c)
 }
 
 /**
- * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
- * @c:		Execution context
+ * tap_sock_update_pool() - Set the buffer base and size for the pool of packets
+ * @base:	Buffer base
+ * @size	Buffer size
  */
-void tap_sock_init(struct ctx *c)
+static void tap_sock_update_pool(void *base, size_t size)
 {
-	size_t sz = sizeof(pkt_buf);
 	int i;
 
-	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
-	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
+	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, base, size);
+	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, base, size);
 
 	for (i = 0; i < TAP_SEQS; i++) {
-		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
-		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
+		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
+		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
 	}
+}
+
+/**
+ * tap_backend_init() - Create and set up AF_UNIX socket or
+ *			tuntap file descriptor
+ * @c:		Execution context
+ */
+void tap_backend_init(struct ctx *c)
+{
+	tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
 
 	if (c->fd_tap != -1) { /* Passed as --fd */
 		struct epoll_event ev = { 0 };
@@ -1365,4 +1389,6 @@ void tap_sock_init(struct ctx *c)
 		 */
 		memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
 	}
+
+	tap_backend_show_hints(c);
 }
diff --git a/tap.h b/tap.h
index 85f1e8473711..8728cc5c09c3 100644
--- a/tap.h
+++ b/tap.h
@@ -68,7 +68,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 int tap_sock_unix_open(char *sock_path);
-void tap_sock_init(struct ctx *c);
+void tap_backend_init(struct ctx *c);
 void tap_flush_pools(void);
 void tap_handler(struct ctx *c, const struct timespec *now);
 void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
-- 
@@ -68,7 +68,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 int tap_sock_unix_open(char *sock_path);
-void tap_sock_init(struct ctx *c);
+void tap_backend_init(struct ctx *c);
 void tap_flush_pools(void);
 void tap_handler(struct ctx *c, const struct timespec *now);
 void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
                   ` (5 preceding siblings ...)
  2024-10-10 12:29 ` [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init() Laurent Vivier
@ 2024-10-10 12:29 ` Laurent Vivier
  2024-10-15  3:23   ` David Gibson
                     ` (2 more replies)
  2024-10-10 12:29 ` [PATCH v8 8/8] test: Add tests for passt in vhost-user mode Laurent Vivier
  7 siblings, 3 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:29 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile     |   6 +-
 conf.c       |  21 ++-
 epoll_type.h |   4 +
 iov.c        |   1 -
 isolation.c  |  15 +-
 packet.c     |  11 ++
 packet.h     |   8 +-
 passt.1      |  10 +-
 passt.c      |   9 +
 passt.h      |   6 +
 pcap.c       |   1 -
 tap.c        |  80 +++++++--
 tap.h        |   5 +-
 tcp.c        |   7 +
 tcp_vu.c     | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tcp_vu.h     |  12 ++
 udp.c        |  10 ++
 udp_vu.c     | 336 ++++++++++++++++++++++++++++++++++++
 udp_vu.h     |  13 ++
 vhost_user.c |  37 ++--
 vhost_user.h |   4 +-
 virtio.c     |   5 -
 vu_common.c  | 385 +++++++++++++++++++++++++++++++++++++++++
 vu_common.h  |  47 +++++
 24 files changed, 1454 insertions(+), 55 deletions(-)
 create mode 100644 tcp_vu.c
 create mode 100644 tcp_vu.h
 create mode 100644 udp_vu.c
 create mode 100644 udp_vu.h
 create mode 100644 vu_common.c
 create mode 100644 vu_common.h

diff --git a/Makefile b/Makefile
index 0e8ed60a0da1..1e8910dda1f4 100644
--- a/Makefile
+++ b/Makefile
@@ -54,7 +54,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
 	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
+	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -64,7 +65,8 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
 	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	udp.h udp_flow.h util.h vhost_user.h virtio.h
+	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
+	virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/conf.c b/conf.c
index c63101970155..29d6e41f5770 100644
--- a/conf.c
+++ b/conf.c
@@ -45,6 +45,7 @@
 #include "lineread.h"
 #include "isolation.h"
 #include "log.h"
+#include "vhost_user.h"
 
 /**
  * next_chunk - Return the next piece of a string delimited by a character
@@ -762,9 +763,14 @@ static void usage(const char *name, FILE *f, int status)
 			"    default: same interface name as external one\n");
 	} else {
 		fprintf(f,
-			"  -s, --socket PATH	UNIX domain socket path\n"
+			"  -s, --socket, --socket-path PATH	UNIX domain socket path\n"
 			"    default: probe free path starting from "
 			UNIX_SOCK_PATH "\n", 1);
+		fprintf(f,
+			"  --vhost-user		Enable vhost-user mode\n"
+			"    UNIX domain socket is provided by -s option\n"
+			"  --print-capabilities	print back-end capabilities in JSON format,\n"
+			"    only meaningful for vhost-user mode\n");
 	}
 
 	fprintf(f,
@@ -1290,6 +1296,10 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"map-host-loopback", required_argument, NULL,		21 },
 		{"map-guest-addr", required_argument,	NULL,		22 },
 		{"dns-host",	required_argument,	NULL,		24 },
+		{"vhost-user",	no_argument,		NULL,		25 },
+		/* vhost-user backend program convention */
+		{"print-capabilities", no_argument,	NULL,		26 },
+		{"socket-path",	required_argument,	NULL,		's' },
 		{ 0 },
 	};
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1478,6 +1488,15 @@ void conf(struct ctx *c, int argc, char **argv)
 				break;
 
 			die("Invalid host nameserver address: %s", optarg);
+		case 25:
+			if (c->mode == MODE_PASTA) {
+				err("--vhost-user is for passt mode only");
+				usage(argv[0], stdout, EXIT_SUCCESS);
+			}
+			c->mode = MODE_VU;
+			break;
+		case 26:
+			vu_print_capabilities();
 			break;
 		case 'd':
 			c->debug = 1;
diff --git a/epoll_type.h b/epoll_type.h
index 0ad1efa0ccec..f3ef41584757 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -36,6 +36,10 @@ enum epoll_type {
 	EPOLL_TYPE_TAP_PASST,
 	/* socket listening for qemu socket connections */
 	EPOLL_TYPE_TAP_LISTEN,
+	/* vhost-user command socket */
+	EPOLL_TYPE_VHOST_CMD,
+	/* vhost-user kick event socket */
+	EPOLL_TYPE_VHOST_KICK,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/iov.c b/iov.c
index 3f9e229a305f..3741db21790f 100644
--- a/iov.c
+++ b/iov.c
@@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
  *
  * Returns:    The number of bytes successfully copied.
  */
-/* cppcheck-suppress unusedFunction */
 size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
 		    size_t offset, const void *buf, size_t bytes)
 {
diff --git a/isolation.c b/isolation.c
index 45fba1e68b9d..c2a3c7b7911d 100644
--- a/isolation.c
+++ b/isolation.c
@@ -379,12 +379,19 @@ void isolate_postfork(const struct ctx *c)
 
 	prctl(PR_SET_DUMPABLE, 0);
 
-	if (c->mode == MODE_PASTA) {
-		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
-		prog.filter = filter_pasta;
-	} else {
+	switch (c->mode) {
+	case MODE_PASST:
 		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
 		prog.filter = filter_passt;
+		break;
+	case MODE_PASTA:
+		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
+		prog.filter = filter_pasta;
+		break;
+	case MODE_VU:
+		prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
+		prog.filter = filter_vu;
+		break;
 	}
 
 	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
diff --git a/packet.c b/packet.c
index 37489961a37e..e5a78d079231 100644
--- a/packet.c
+++ b/packet.c
@@ -36,6 +36,17 @@
 static int packet_check_range(const struct pool *p, size_t offset, size_t len,
 			      const char *start, const char *func, int line)
 {
+	if (p->buf_size == 0) {
+		int ret;
+
+		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
+
+		if (ret == -1)
+			trace("cannot find region, %s:%i", func, line);
+
+		return ret;
+	}
+
 	if (start < p->buf) {
 		trace("packet start %p before buffer start %p, "
 		      "%s:%i", (void *)start, (void *)p->buf, func, line);
diff --git a/packet.h b/packet.h
index 8377dcf678bb..3f70e949c066 100644
--- a/packet.h
+++ b/packet.h
@@ -8,8 +8,10 @@
 
 /**
  * struct pool - Generic pool of packets stored in a buffer
- * @buf:	Buffer storing packet descriptors
- * @buf_size:	Total size of buffer
+ * @buf:	Buffer storing packet descriptors,
+ * 		a struct vu_dev_region array for passt vhost-user mode
+ * @buf_size:	Total size of buffer,
+ * 		0 for passt vhost-user mode
  * @size:	Number of usable descriptors for the pool
  * @count:	Number of used descriptors for the pool
  * @pkt:	Descriptors: see macros below
@@ -22,6 +24,8 @@ struct pool {
 	struct iovec pkt[1];
 };
 
+int vu_packet_check_range(void *buf, size_t offset, size_t len,
+			  const char *start);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
diff --git a/passt.1 b/passt.1
index ef33267e9cd7..96532dd39aa2 100644
--- a/passt.1
+++ b/passt.1
@@ -397,12 +397,20 @@ interface address are configured on a given host interface.
 .SS \fBpasst\fR-only options
 
 .TP
-.BR \-s ", " \-\-socket " " \fIpath
+.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath
 Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to connect to
 \fBpasst\fR.
 Default is to probe a free socket, not accepting connections, starting from
 \fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR.
 
+.TP
+.BR \-\-vhost-user
+Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
+
+.TP
+.BR \-\-print-capabilities
+Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
+
 .TP
 .BR \-F ", " \-\-fd " " \fIFD
 Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
diff --git a/passt.c b/passt.c
index 79093ee02d62..2d105e81218d 100644
--- a/passt.c
+++ b/passt.c
@@ -52,6 +52,7 @@
 #include "arch.h"
 #include "log.h"
 #include "tcp_splice.h"
+#include "vu_common.h"
 
 #define EPOLL_EVENTS		8
 
@@ -74,6 +75,8 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_PASTA]		= "/dev/net/tun device",
 	[EPOLL_TYPE_TAP_PASST]		= "connected qemu socket",
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
+	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
+	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -360,6 +363,12 @@ loop:
 		case EPOLL_TYPE_PING:
 			icmp_sock_handler(&c, ref);
 			break;
+		case EPOLL_TYPE_VHOST_CMD:
+			vu_control_handler(c.vdev, c.fd_tap, eventmask);
+			break;
+		case EPOLL_TYPE_VHOST_KICK:
+			vu_kick_cb(c.vdev, ref, &now);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index 4908ed937dc8..311482d36257 100644
--- a/passt.h
+++ b/passt.h
@@ -25,6 +25,8 @@ union epoll_ref;
 #include "fwd.h"
 #include "tcp.h"
 #include "udp.h"
+#include "udp_vu.h"
+#include "vhost_user.h"
 
 /* Default address for our end on the tap interface.  Bit 0 of byte 0 must be 0
  * (unicast) and bit 1 of byte 1 must be 1 (locally administered).  Otherwise
@@ -94,6 +96,7 @@ struct fqdn {
 enum passt_modes {
 	MODE_PASST,
 	MODE_PASTA,
+	MODE_VU,
 };
 
 /**
@@ -228,6 +231,7 @@ struct ip6_ctx {
  * @freebind:		Allow binding of non-local addresses for forwarding
  * @low_wmem:		Low probed net.core.wmem_max
  * @low_rmem:		Low probed net.core.rmem_max
+ * @vdev:		vhost-user device
  */
 struct ctx {
 	enum passt_modes mode;
@@ -289,6 +293,8 @@ struct ctx {
 
 	int low_wmem;
 	int low_rmem;
+
+	struct vu_dev *vdev;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/pcap.c b/pcap.c
index 6ee6cdfd261a..718d6ad61732 100644
--- a/pcap.c
+++ b/pcap.c
@@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
  * @iovcnt:	Number of buffers (@iov entries)
  * @offset:	Offset of the L2 frame within the full data length
  */
-/* cppcheck-suppress unusedFunction */
 void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
 {
 	struct timespec now;
diff --git a/tap.c b/tap.c
index 4b826fdf7adc..22d19f1833f7 100644
--- a/tap.c
+++ b/tap.c
@@ -58,6 +58,8 @@
 #include "packet.h"
 #include "tap.h"
 #include "log.h"
+#include "vhost_user.h"
+#include "vu_common.h"
 
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
@@ -78,16 +80,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
 	struct iovec iov[2];
 	size_t iovcnt = 0;
 
-	if (c->mode == MODE_PASST) {
+	switch (c->mode) {
+	case MODE_PASST:
 		iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
 		iovcnt++;
-	}
-
-	iov[iovcnt].iov_base = (void *)data;
-	iov[iovcnt].iov_len = l2len;
-	iovcnt++;
+		/* fall through */
+	case MODE_PASTA:
+		iov[iovcnt].iov_base = (void *)data;
+		iov[iovcnt].iov_len = l2len;
+		iovcnt++;
 
-	tap_send_frames(c, iov, iovcnt, 1);
+		tap_send_frames(c, iov, iovcnt, 1);
+		break;
+	case MODE_VU:
+		vu_send_single(c, data, l2len);
+		break;
+	}
 }
 
 /**
@@ -414,10 +422,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 	if (!nframes)
 		return 0;
 
-	if (c->mode == MODE_PASTA)
+	switch (c->mode) {
+	case MODE_PASTA:
 		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
-	else
+		break;
+	case MODE_PASST:
 		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
+		break;
+	case MODE_VU:
+		/* fall through */
+	default:
+		ASSERT(0);
+	}
 
 	if (m < nframes)
 		debug("tap: failed to send %zu frames of %zu",
@@ -976,7 +992,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
  * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
  * @c:		Execution context
  */
-static void tap_sock_reset(struct ctx *c)
+void tap_sock_reset(struct ctx *c)
 {
 	info("Client connection closed%s", c->one_off ? ", exiting" : "");
 
@@ -987,6 +1003,8 @@ static void tap_sock_reset(struct ctx *c)
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
 	close(c->fd_tap);
 	c->fd_tap = -1;
+	if (c->mode == MODE_VU)
+		vu_cleanup(c->vdev);
 }
 
 /**
@@ -1205,6 +1223,11 @@ static void tap_backend_show_hints(struct ctx *c)
 		info("or qrap, for earlier qemu versions:");
 		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
 		break;
+	case MODE_VU:
+		info("You can start qemu with:");
+		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
+		     c->sock_path);
+		break;
 	}
 }
 
@@ -1232,8 +1255,8 @@ static void tap_sock_unix_init(const struct ctx *c)
  */
 void tap_listen_handler(struct ctx *c, uint32_t events)
 {
-	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
 	struct epoll_event ev = { 0 };
+	union epoll_ref ref = { 0 };
 	int v = INT_MAX / 2;
 	struct ucred ucred;
 	socklen_t len;
@@ -1273,6 +1296,10 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
+	if (c->mode == MODE_VU)
+		ref.type = EPOLL_TYPE_VHOST_CMD;
+	else
+		ref.type = EPOLL_TYPE_TAP_PASST;
 	ev.events = EPOLLIN | EPOLLRDHUP;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
@@ -1339,7 +1366,7 @@ static void tap_sock_tun_init(struct ctx *c)
  * @base:	Buffer base
  * @size	Buffer size
  */
-static void tap_sock_update_pool(void *base, size_t size)
+void tap_sock_update_pool(void *base, size_t size)
 {
 	int i;
 
@@ -1353,13 +1380,15 @@ static void tap_sock_update_pool(void *base, size_t size)
 }
 
 /**
- * tap_backend_init() - Create and set up AF_UNIX socket or
- *			tuntap file descriptor
+ * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
  * @c:		Execution context
  */
 void tap_backend_init(struct ctx *c)
 {
-	tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
+	if (c->mode == MODE_VU)
+		tap_sock_update_pool(NULL, 0);
+	else
+		tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
 
 	if (c->fd_tap != -1) { /* Passed as --fd */
 		struct epoll_event ev = { 0 };
@@ -1367,10 +1396,17 @@ void tap_backend_init(struct ctx *c)
 
 		ASSERT(c->one_off);
 		ref.fd = c->fd_tap;
-		if (c->mode == MODE_PASST)
+		switch (c->mode) {
+		case MODE_PASST:
 			ref.type = EPOLL_TYPE_TAP_PASST;
-		else
+			break;
+		case MODE_PASTA:
 			ref.type = EPOLL_TYPE_TAP_PASTA;
+			break;
+		case MODE_VU:
+			ref.type = EPOLL_TYPE_VHOST_CMD;
+			break;
+		}
 
 		ev.events = EPOLLIN | EPOLLRDHUP;
 		ev.data.u64 = ref.u64;
@@ -1378,9 +1414,14 @@ void tap_backend_init(struct ctx *c)
 		return;
 	}
 
-	if (c->mode == MODE_PASTA) {
+	switch (c->mode) {
+	case MODE_PASTA:
 		tap_sock_tun_init(c);
-	} else {
+		break;
+	case MODE_VU:
+		vu_init(c);
+		/* fall through */
+	case MODE_PASST:
 		tap_sock_unix_init(c);
 
 		/* In passt mode, we don't know the guest's MAC address until it
@@ -1388,6 +1429,7 @@ void tap_backend_init(struct ctx *c)
 		 * first packets will reach it.
 		 */
 		memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
+		break;
 	}
 
 	tap_backend_show_hints(c);
diff --git a/tap.h b/tap.h
index 8728cc5c09c3..dfbd8b9ebd72 100644
--- a/tap.h
+++ b/tap.h
@@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
  */
 static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 {
-	thdr->vnet_len = htonl(l2len);
+	if (thdr)
+		thdr->vnet_len = htonl(l2len);
 }
 
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
@@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 int tap_sock_unix_open(char *sock_path);
+void tap_sock_reset(struct ctx *c);
+void tap_sock_update_pool(void *base, size_t size);
 void tap_backend_init(struct ctx *c);
 void tap_flush_pools(void);
 void tap_handler(struct ctx *c, const struct timespec *now);
diff --git a/tcp.c b/tcp.c
index eae02b1647e3..fd2def0d8a39 100644
--- a/tcp.c
+++ b/tcp.c
@@ -304,6 +304,7 @@
 #include "flow_table.h"
 #include "tcp_internal.h"
 #include "tcp_buf.h"
+#include "tcp_vu.h"
 
 /* MSS rounding: see SET_MSS() */
 #define MSS_DEFAULT			536
@@ -1328,6 +1329,9 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
 			 int flags)
 {
+	if (c->mode == MODE_VU)
+		return tcp_vu_send_flag(c, conn, flags);
+
 	return tcp_buf_send_flag(c, conn, flags);
 }
 
@@ -1721,6 +1725,9 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
  */
 static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 {
+	if (c->mode == MODE_VU)
+		return tcp_vu_data_from_sock(c, conn);
+
 	return tcp_buf_data_from_sock(c, conn);
 }
 
diff --git a/tcp_vu.c b/tcp_vu.c
new file mode 100644
index 000000000000..1126fb39d138
--- /dev/null
+++ b/tcp_vu.c
@@ -0,0 +1,476 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* tcp_vu.c - TCP L2 vhost-user management functions
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#include <errno.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <netinet/ip.h>
+
+#include <sys/socket.h>
+
+#include <linux/tcp.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "siphash.h"
+#include "inany.h"
+#include "vhost_user.h"
+#include "tcp.h"
+#include "pcap.h"
+#include "flow.h"
+#include "tcp_conn.h"
+#include "flow_table.h"
+#include "tcp_vu.h"
+#include "tap.h"
+#include "tcp_internal.h"
+#include "checksum.h"
+#include "vu_common.h"
+
+static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
+static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
+
+/**
+ * tcp_vu_l2_hdrlen() - return the size of the header in level 2 frame (TDP)
+ * @v6:		Set for IPv6 packet
+ *
+ * Return: Return the size of the header
+ */
+static size_t tcp_vu_l2_hdrlen(bool v6)
+{
+	size_t l2_hdrlen;
+
+	l2_hdrlen = sizeof(struct ethhdr) + sizeof(struct tcphdr);
+
+	if (v6)
+		l2_hdrlen += sizeof(struct ipv6hdr);
+	else
+		l2_hdrlen += sizeof(struct iphdr);
+
+	return l2_hdrlen;
+}
+
+/**
+ * tcp_vu_update_check() - Calculate TCP checksum
+ * @tapside:	Address information for one side of the flow
+ * @iov:	Pointer to the array of IO vectors
+ * @iov_used:	Length of the array
+ */
+static void tcp_vu_update_check(const struct flowside *tapside,
+			        struct iovec *iov, int iov_used)
+{
+	char *base = iov[0].iov_base;
+
+	if (inany_v4(&tapside->oaddr)) {
+		const struct iphdr *iph = vu_ip(base);
+
+		tcp_update_check_tcp4(iph, iov, iov_used,
+				      (char *)vu_payloadv4(base) - base);
+	} else {
+		const struct ipv6hdr *ip6h = vu_ip(base);
+
+		tcp_update_check_tcp6(ip6h, iov, iov_used,
+				      (char *)vu_payloadv6(base) - base);
+	}
+}
+
+/**
+ * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload)
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ * @flags:	TCP flags: if not set, send segment only if ACK is due
+ *
+ * Return: negative error code on connection reset, 0 otherwise
+ */
+int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	const struct flowside *tapside = TAPFLOW(conn);
+	size_t l2len, l4len, optlen, hdrlen;
+	struct ethhdr *eh;
+	int elem_cnt;
+	int nb_ack;
+	int ret;
+
+	hdrlen = tcp_vu_l2_hdrlen(CONN_V6(conn));
+
+	vu_init_elem(elem, iov_vu, 2);
+
+	elem_cnt = vu_collect_one_frame(vdev, vq, elem, 1,
+					hdrlen + OPT_MSS_LEN + OPT_WS_LEN + 1,
+					0, NULL);
+	if (elem_cnt < 1)
+		return 0;
+
+	vu_set_vnethdr(vdev, &iov_vu[0], 1, 0);
+
+	eh = vu_eth(iov_vu[0].iov_base);
+
+	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
+
+	if (CONN_V4(conn)) {
+		struct tcp_payload_t *payload;
+		struct iphdr *iph;
+		uint32_t seq;
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		iph = vu_ip(iov_vu[0].iov_base);
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+
+		payload = vu_payloadv4(iov_vu[0].iov_base);
+		memset(&payload->th, 0, sizeof(payload->th));
+		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
+		payload->th.ack = 1;
+
+		seq = conn->seq_to_tap;
+		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
+					(char *)payload->data, &optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vq, 1);
+			return ret;
+		}
+
+		l4len = tcp_fill_headers4(conn, NULL, iph, payload, optlen,
+					  NULL, seq, true);
+		l2len = sizeof(*iph);
+	} else {
+		struct tcp_payload_t *payload;
+		struct ipv6hdr *ip6h;
+		uint32_t seq;
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		ip6h = vu_ip(iov_vu[0].iov_base);
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+		payload = vu_payloadv6(iov_vu[0].iov_base);
+		memset(&payload->th, 0, sizeof(payload->th));
+		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
+		payload->th.ack = 1;
+
+		seq = conn->seq_to_tap;
+		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
+					(char *)payload->data, &optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vq, 1);
+			return ret;
+		}
+
+		l4len = tcp_fill_headers6(conn, NULL, ip6h, payload, optlen,
+					  seq, true);
+		l2len = sizeof(*ip6h);
+	}
+	l2len += l4len + sizeof(struct ethhdr);
+
+	elem[0].in_sg[0].iov_len = l2len +
+				   sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	if (*c->pcap) {
+		tcp_vu_update_check(tapside, &elem[0].in_sg[0], 1);
+		pcap_iov(&elem[0].in_sg[0], 1,
+			 sizeof(struct virtio_net_hdr_mrg_rxbuf));
+	}
+	nb_ack = 1;
+
+	if (flags & DUP_ACK) {
+		elem_cnt = vu_collect_one_frame(vdev, vq, &elem[1], 1, l2len,
+						0, NULL);
+		if (elem_cnt == 1) {
+			memcpy(elem[1].in_sg[0].iov_base,
+			       elem[0].in_sg[0].iov_base, l2len);
+			vu_set_vnethdr(vdev, &elem[1].in_sg[0], 1, 0);
+			nb_ack++;
+
+			if (*c->pcap)
+				pcap_iov(&elem[1].in_sg[0], 1, 0);
+		}
+	}
+
+	vu_flush(vdev, vq, elem, nb_ack);
+
+	return 0;
+}
+
+/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user buffers
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ * @v4:		Set for IPv4 connections
+ * @fillsize:	Number of bytes we can receive
+ * @datalen:	Size of received data (output)
+ *
+ * Return: Number of iov entries used to store the data
+ */
+static ssize_t tcp_vu_sock_recv(const struct ctx *c,
+				struct tcp_tap_conn *conn, bool v4,
+				size_t fillsize, ssize_t *dlen)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	struct msghdr mh_sock = { 0 };
+	uint16_t mss = MSS_GET(conn);
+	int s = conn->sock;
+	size_t l2_hdrlen;
+	int elem_cnt;
+	ssize_t ret;
+
+	*dlen = 0;
+
+	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
+
+	vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
+
+	elem_cnt = vu_collect(vdev, vq, elem, VIRTQUEUE_MAX_SIZE, mss,
+			      l2_hdrlen, fillsize);
+	if (elem_cnt < 0) {
+		tcp_rst(c, conn);
+		return -ENOMEM;
+	}
+
+	mh_sock.msg_iov = iov_vu;
+	mh_sock.msg_iovlen = elem_cnt + 1;
+
+	do
+		ret = recvmsg(s, &mh_sock, MSG_PEEK);
+	while (ret < 0 && errno == EINTR);
+
+	if (ret < 0) {
+		vu_queue_rewind(vq, elem_cnt);
+		if (errno != EAGAIN && errno != EWOULDBLOCK) {
+			ret = -errno;
+			tcp_rst(c, conn);
+		}
+		return ret;
+	}
+	if (!ret) {
+		vu_queue_rewind(vq, elem_cnt);
+
+		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
+			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);
+			if (retf) {
+				tcp_rst(c, conn);
+				return retf;
+			}
+
+			conn_event(c, conn, TAP_FIN_SENT);
+		}
+		return 0;
+	}
+
+	*dlen = ret;
+
+	return elem_cnt;
+}
+
+/**
+ * tcp_vu_prepare() - Prepare the packet header
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ * @first:	Pointer to the array of IO vectors
+ * @dlen:	Packet data length
+ * @check:	Checksum, if already known
+ */
+static void tcp_vu_prepare(const struct ctx *c,
+			   struct tcp_tap_conn *conn, struct iovec *first,
+			   size_t dlen, const uint16_t **check)
+{
+	const struct flowside *toside = TAPFLOW(conn);
+	char *base = first->iov_base;
+	struct ethhdr *eh;
+
+	/* we guess the first iovec provided by the guest can embed
+	 * all the headers needed by L2 frame
+	 */
+
+	eh = vu_eth(base);
+
+	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
+
+	/* initialize header */
+	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
+		struct tcp_payload_t *payload;
+		struct iphdr *iph;
+
+		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
+		       sizeof(struct ethhdr) + sizeof(struct iphdr) +
+		       sizeof(struct tcphdr));
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		iph = vu_ip(base);
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+		payload = vu_payloadv4(base);
+		memset(&payload->th, 0, sizeof(payload->th));
+		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
+		payload->th.ack = 1;
+
+		tcp_fill_headers4(conn, NULL, iph, payload, dlen,
+				  *check, conn->seq_to_tap, true);
+		*check = &iph->check;
+	} else {
+		struct tcp_payload_t *payload;
+		struct ipv6hdr *ip6h;
+
+		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
+		       sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
+		       sizeof(struct tcphdr));
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		ip6h = vu_ip(base);
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+		payload = vu_payloadv6(base);
+		memset(&payload->th, 0, sizeof(payload->th));
+		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
+		payload->th.ack = 1;
+
+		tcp_fill_headers6(conn, NULL, ip6h, payload, dlen,
+				  conn->seq_to_tap, true);
+	}
+}
+
+/**
+ * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
+ *			     in window
+ * @c:		Execution context
+ * @conn:	Connection pointer
+ *
+ * Return: Negative on connection reset, 0 otherwise
+ */
+int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
+{
+	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	const struct flowside *tapside = TAPFLOW(conn);
+	uint16_t mss = MSS_GET(conn);
+	size_t l2_hdrlen, fillsize;
+	int i, iov_cnt, iov_used;
+	int v4 = CONN_V4(conn);
+	uint32_t already_sent = 0;
+	const uint16_t *check;
+	struct iovec *first;
+	int frame_size;
+	int num_buffers;
+	ssize_t len;
+
+	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
+		flow_err(conn,
+			 "Got packet, but RX virtqueue not usable yet");
+		return 0;
+	}
+
+	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
+
+	if (SEQ_LT(already_sent, 0)) {
+		/* RFC 761, section 2.1. */
+		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
+			   conn->seq_ack_from_tap, conn->seq_to_tap);
+		conn->seq_to_tap = conn->seq_ack_from_tap;
+		already_sent = 0;
+	}
+
+	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, STALLED);
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
+		return 0;
+	}
+
+	/* Set up buffer descriptors we'll fill completely and partially. */
+
+	fillsize = wnd_scaled;
+
+	if (peek_offset_cap)
+		already_sent = 0;
+
+	iov_vu[0].iov_base = tcp_buf_discard;
+	iov_vu[0].iov_len = already_sent;
+	fillsize -= already_sent;
+
+	/* collect the buffers from vhost-user and fill them with the
+	 * data from the socket
+	 */
+	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
+	if (iov_cnt <= 0)
+		return iov_cnt;
+
+	len -= already_sent;
+	if (len <= 0) {
+		conn_flag(c, conn, STALLED);
+		vu_queue_rewind(vq, iov_cnt);
+		return 0;
+	}
+
+	conn_flag(c, conn, ~STALLED);
+
+	/* Likely, some new data was acked too. */
+	tcp_update_seqack_wnd(c, conn, 0, NULL);
+
+	/* initialize headers */
+	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
+	iov_used = 0;
+	num_buffers = 0;
+	check = NULL;
+	frame_size = 0;
+
+	/* iov_vu is an array of buffers and the buffer size can be
+	 * smaller than the frame size we want to use but with
+	 * num_buffer we can merge several virtio iov buffers in one packet
+	 * we need only to set the packet headers in the first iov and
+	 * num_buffer to the number of iov entries
+	 */
+	for (i = 0; i < iov_cnt && len; i++) {
+
+		if (frame_size == 0)
+			first = &iov_vu[i + 1];
+
+		if (iov_vu[i + 1].iov_len > (size_t)len)
+			iov_vu[i + 1].iov_len = len;
+
+		len -= iov_vu[i + 1].iov_len;
+		iov_used++;
+
+		frame_size += iov_vu[i + 1].iov_len;
+		num_buffers++;
+
+		if (frame_size >= mss || len == 0 ||
+		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
+			if (i + 1 == iov_cnt)
+				check = NULL;
+
+			/* restore first iovec base: point to vnet header */
+			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
+
+			tcp_vu_prepare(c, conn, first, frame_size, &check);
+			if (*c->pcap)  {
+				tcp_vu_update_check(tapside, first, num_buffers);
+				pcap_iov(first, num_buffers,
+					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
+			}
+
+			conn->seq_to_tap += frame_size;
+
+			frame_size = 0;
+			num_buffers = 0;
+		}
+	}
+
+	/* release unused buffers */
+	vu_queue_rewind(vq, iov_cnt - iov_used);
+
+	/* send packets */
+	vu_flush(vdev, vq, elem, iov_used);
+
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
+
+	return 0;
+}
diff --git a/tcp_vu.h b/tcp_vu.h
new file mode 100644
index 000000000000..6ab6057f352a
--- /dev/null
+++ b/tcp_vu.h
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef TCP_VU_H
+#define TCP_VU_H
+
+int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
+int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
+
+#endif  /*TCP_VU_H */
diff --git a/udp.c b/udp.c
index 8fc5d8099310..1171d9d1a75b 100644
--- a/udp.c
+++ b/udp.c
@@ -628,6 +628,11 @@ void udp_listen_sock_handler(const struct ctx *c,
 			     union epoll_ref ref, uint32_t events,
 			     const struct timespec *now)
 {
+	if (c->mode == MODE_VU) {
+		udp_vu_listen_sock_handler(c, ref, events, now);
+		return;
+	}
+
 	udp_buf_listen_sock_handler(c, ref, events, now);
 }
 
@@ -697,6 +702,11 @@ static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 			    uint32_t events, const struct timespec *now)
 {
+	if (c->mode == MODE_VU) {
+		udp_vu_reply_sock_handler(c, ref, events, now);
+		return;
+	}
+
 	udp_buf_reply_sock_handler(c, ref, events, now);
 }
 
diff --git a/udp_vu.c b/udp_vu.c
new file mode 100644
index 000000000000..3cb76945c9c1
--- /dev/null
+++ b/udp_vu.c
@@ -0,0 +1,336 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* udp_vu.c - UDP L2 vhost-user management functions
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#include <unistd.h>
+#include <assert.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/udp.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/uio.h>
+#include <linux/virtio_net.h>
+
+#include "checksum.h"
+#include "util.h"
+#include "ip.h"
+#include "siphash.h"
+#include "inany.h"
+#include "passt.h"
+#include "pcap.h"
+#include "log.h"
+#include "vhost_user.h"
+#include "udp_internal.h"
+#include "flow.h"
+#include "flow_table.h"
+#include "udp_flow.h"
+#include "udp_vu.h"
+#include "vu_common.h"
+
+static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
+static struct vu_virtq_element	elem		[VIRTQUEUE_MAX_SIZE];
+
+/**
+ * udp_vu_l2_hdrlen() - return the size of the header in level 2 frame (UDP)
+ * @v6:		Set for IPv6 packet
+ *
+ * Return: Return the size of the header
+ */
+static size_t udp_vu_l2_hdrlen(bool v6)
+{
+	size_t l2_hdrlen;
+
+	l2_hdrlen = sizeof(struct ethhdr) + sizeof(struct udphdr);
+
+	if (v6)
+		l2_hdrlen += sizeof(struct ipv6hdr);
+	else
+		l2_hdrlen += sizeof(struct iphdr);
+
+	return l2_hdrlen;
+}
+
+static int udp_vu_sock_init(int s, union sockaddr_inany *s_in)
+{
+	struct msghdr msg = {
+		.msg_name = s_in,
+		.msg_namelen = sizeof(union sockaddr_inany),
+	};
+
+	return recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
+}
+
+/**
+ * udp_vu_sock_recv() - Receive datagrams from socket into vhost-user buffers
+ * @c:		Execution context
+ * @s:		Socket to receive from
+ * @events:	epoll events bitmap
+ * @v6:		Set for IPv6 connections
+ * @datalen:	Size of received data (output)
+ *
+ * Return: Number of iov entries used to store the datagram
+ */
+static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events,
+			    bool v6, ssize_t *dlen)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	int max_elem, iov_cnt, idx, iov_used;
+	struct msghdr msg  = { 0 };
+	size_t off, l2_hdrlen;
+
+	ASSERT(!c->no_udp);
+
+	if (!(events & EPOLLIN))
+		return 0;
+
+	/* compute L2 header length */
+
+	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+		max_elem = VIRTQUEUE_MAX_SIZE;
+	else
+		max_elem = 1;
+
+	l2_hdrlen = udp_vu_l2_hdrlen(v6);
+
+	vu_init_elem(elem, iov_vu, max_elem);
+
+	iov_cnt = vu_collect_one_frame(vdev, vq, elem, max_elem,
+			      ETH_MAX_MTU - l2_hdrlen,
+			      l2_hdrlen, NULL);
+	if (iov_cnt == 0)
+		return 0;
+
+	msg.msg_iov = iov_vu;
+	msg.msg_iovlen = iov_cnt;
+
+	*dlen = recvmsg(s, &msg, 0);
+	if (*dlen < 0) {
+		vu_queue_rewind(vq, iov_cnt);
+		return 0;
+	}
+
+	/* count the numbers of buffer filled by recvmsg() */
+	idx = iov_skip_bytes(iov_vu, iov_cnt, *dlen, &off);
+
+	/* adjust last iov length */
+	if (idx < iov_cnt)
+		iov_vu[idx].iov_len = off;
+	iov_used = idx + !!off;
+
+	/* we have at least the header */
+	if (iov_used == 0)
+		iov_used = 1;
+
+	/* release unused buffers */
+	vu_queue_rewind(vq, iov_cnt - iov_used);
+
+	vu_set_vnethdr(vdev, &iov_vu[0], iov_used, l2_hdrlen);
+
+	return iov_used;
+}
+
+/**
+ * udp_vu_prepare() - Prepare the packet header
+ * @c:		Execution context
+ * @toside:	Address information for one side of the flow
+ * @datalen:	Packet data length
+ *
+ * Return: Layer-4 length
+ */
+static size_t udp_vu_prepare(const struct ctx *c,
+			     const struct flowside *toside, ssize_t dlen)
+{
+	struct ethhdr *eh;
+	size_t l4len;
+
+	/* ethernet header */
+	eh = vu_eth(iov_vu[0].iov_base);
+
+	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
+
+	/* initialize header */
+	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
+		struct iphdr *iph = vu_ip(iov_vu[0].iov_base);
+		struct udp_payload_t *bp = vu_payloadv4(iov_vu[0].iov_base);
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
+
+		l4len = udp_update_hdr4(iph, bp, toside, dlen, true);
+	} else {
+		struct ipv6hdr *ip6h = vu_ip(iov_vu[0].iov_base);
+		struct udp_payload_t *bp = vu_payloadv6(iov_vu[0].iov_base);
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
+
+		l4len = udp_update_hdr6(ip6h, bp, toside, dlen, true);
+	}
+
+	return l4len;
+}
+
+/**
+ * udp_vu_csum() - Calculate and set checksum for a UDP packet
+ * @toside:	ddress information for one side of the flow
+ * @l4len:	IPv4 Payload length
+ * @iov_used:	Length of the array
+ */
+static void udp_vu_csum(const struct flowside *toside, int iov_used)
+{
+	const struct in_addr *src4 = inany_v4(&toside->oaddr);
+	const struct in_addr *dst4 = inany_v4(&toside->eaddr);
+	char *base = iov_vu[0].iov_base;
+	struct udp_payload_t *bp;
+
+	if (src4 && dst4) {
+		bp = vu_payloadv4(base);
+		csum_udp4(&bp->uh, *src4, *dst4, iov_vu, iov_used,
+			  (char *)&bp->data - base);
+	} else {
+		bp = vu_payloadv6(base);
+		csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6,
+			  iov_vu, iov_used, (char *)&bp->data - base);
+	}
+}
+
+/**
+ * udp_vu_listen_sock_handler() - Handle new data from socket
+ * @c:		Execution context
+ * @ref:	epoll reference
+ * @events:	epoll events bitmap
+ * @now:	Current timestamp
+ */
+void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	int i;
+
+	if (udp_sock_errs(c, ref.fd, events) < 0) {
+		err("UDP: Unrecoverable error on listening socket:"
+		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
+		return;
+	}
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+		const struct flowside *toside;
+		union sockaddr_inany s_in;
+		flow_sidx_t batchsidx;
+		uint8_t batchpif;
+		ssize_t dlen;
+		int iov_used;
+		bool v6;
+
+		if (udp_vu_sock_init(ref.fd, &s_in) < 0)
+			break;
+
+		batchsidx = udp_flow_from_sock(c, ref, &s_in, now);
+		batchpif = pif_at_sidx(batchsidx);
+
+		if (batchpif != PIF_TAP) {
+			if (flow_sidx_valid(batchsidx)) {
+				flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
+				struct udp_flow *uflow = udp_at_sidx(batchsidx);
+
+				flow_err(uflow,
+					"No support for forwarding UDP from %s to %s",
+					pif_name(pif_at_sidx(fromsidx)),
+					pif_name(batchpif));
+			} else {
+				debug("Discarding 1 datagram without flow");
+			}
+
+			continue;
+		}
+
+		toside = flowside_at_sidx(batchsidx);
+
+		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
+
+		iov_used = udp_vu_sock_recv(c, ref.fd, events, v6, &dlen);
+		if (iov_used <= 0)
+			break;
+
+		udp_vu_prepare(c, toside, dlen);
+		if (*c->pcap) {
+			udp_vu_csum(toside, iov_used);
+			pcap_iov(iov_vu, iov_used,
+				 sizeof(struct virtio_net_hdr_mrg_rxbuf));
+		}
+		vu_flush(vdev, vq, elem, iov_used);
+	}
+}
+
+/**
+ * udp_vu_reply_sock_handler() - Handle new data from flow specific socket
+ * @c:		Execution context
+ * @ref:	epoll reference
+ * @events:	epoll events bitmap
+ * @now:	Current timestamp
+ */
+void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+			        uint32_t events, const struct timespec *now)
+{
+	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
+	const struct flowside *toside = flowside_at_sidx(tosidx);
+	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
+	int from_s = uflow->s[ref.flowside.sidei];
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	int i;
+
+	ASSERT(!c->no_udp);
+
+	if (udp_sock_errs(c, from_s, events) < 0) {
+		flow_err(uflow, "Unrecoverable error on reply socket");
+		flow_err_details(uflow);
+		udp_flow_close(c, uflow);
+		return;
+	}
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+		uint8_t topif = pif_at_sidx(tosidx);
+		ssize_t dlen;
+		int iov_used;
+		bool v6;
+
+		ASSERT(uflow);
+
+		if (topif != PIF_TAP) {
+			uint8_t frompif = pif_at_sidx(ref.flowside);
+
+			flow_err(uflow,
+				 "No support for forwarding UDP from %s to %s",
+				 pif_name(frompif), pif_name(topif));
+			continue;
+		}
+
+		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
+
+		iov_used = udp_vu_sock_recv(c, from_s, events, v6, &dlen);
+		if (iov_used <= 0)
+			break;
+		flow_trace(uflow, "Received 1 datagram on reply socket");
+		uflow->ts = now->tv_sec;
+
+		udp_vu_prepare(c, toside, dlen);
+		if (*c->pcap) {
+			udp_vu_csum(toside, iov_used);
+			pcap_iov(iov_vu, iov_used,
+				 sizeof(struct virtio_net_hdr_mrg_rxbuf));
+		}
+		vu_flush(vdev, vq, elem, iov_used);
+	}
+}
diff --git a/udp_vu.h b/udp_vu.h
new file mode 100644
index 000000000000..ba7018d3bf01
--- /dev/null
+++ b/udp_vu.h
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef UDP_VU_H
+#define UDP_VU_H
+
+void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now);
+void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+			       uint32_t events, const struct timespec *now);
+#endif /* UDP_VU_H */
diff --git a/vhost_user.c b/vhost_user.c
index 1e302926b8fe..e905f3329f71 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -48,12 +48,13 @@
 /* vhost-user version we are compatible with */
 #define VHOST_USER_VERSION 1
 
+static struct vu_dev vdev_storage;
+
 /**
  * vu_print_capabilities() - print vhost-user capabilities
  * 			     this is part of the vhost-user backend
  * 			     convention.
  */
-/* cppcheck-suppress unusedFunction */
 void vu_print_capabilities(void)
 {
 	info("{");
@@ -163,9 +164,7 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
  */
 static void vu_remove_watch(const struct vu_dev *vdev, int fd)
 {
-	/* Placeholder to add passt related code */
-	(void)vdev;
-	(void)fd;
+	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
 }
 
 /**
@@ -487,6 +486,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
 		}
 	}
 
+	/* As vu_packet_check_range() has no access to the number of
+	 * memory regions, mark the end of the array with mmap_addr = 0
+	 */
+	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
+	vdev->regions[vdev->nregions].mmap_addr = 0;
+
+	tap_sock_update_pool(vdev->regions, 0);
+
 	return false;
 }
 
@@ -615,9 +622,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
  */
 static void vu_set_watch(const struct vu_dev *vdev, int fd)
 {
-	/* Placeholder to add passt related code */
-	(void)vdev;
-	(void)fd;
+	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
+	struct epoll_event ev = { 0 };
+
+	ev.data.u64 = ref.u64;
+	ev.events = EPOLLIN;
+	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev);
 }
 
 /**
@@ -829,14 +839,14 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
  * @c:		execution context
  * @vdev:	vhost-user device
  */
-/* cppcheck-suppress unusedFunction */
-void vu_init(struct ctx *c, struct vu_dev *vdev)
+void vu_init(struct ctx *c)
 {
 	int i;
 
-	vdev->context = c;
+	c->vdev = &vdev_storage;
+	c->vdev->context = c;
 	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
-		vdev->vq[i] = (struct vu_virtq){
+		c->vdev->vq[i] = (struct vu_virtq){
 			.call_fd = -1,
 			.kick_fd = -1,
 			.err_fd = -1,
@@ -849,7 +859,6 @@ void vu_init(struct ctx *c, struct vu_dev *vdev)
  * vu_cleanup() - Reset vhost-user device
  * @vdev:	vhost-user device
  */
-/* cppcheck-suppress unusedFunction */
 void vu_cleanup(struct vu_dev *vdev)
 {
 	unsigned int i;
@@ -896,8 +905,7 @@ void vu_cleanup(struct vu_dev *vdev)
  */
 static void vu_sock_reset(struct vu_dev *vdev)
 {
-	/* Placeholder to add passt related code */
-	(void)vdev;
+	tap_sock_reset(vdev->context);
 }
 
 static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
@@ -925,7 +933,6 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
  * @fd:		vhost-user message socket
  * @events:	epoll events
  */
-/* cppcheck-suppress unusedFunction */
 void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 {
 	struct vhost_user_msg msg = { 0 };
diff --git a/vhost_user.h b/vhost_user.h
index 5af349ba58b8..464ba21e962f 100644
--- a/vhost_user.h
+++ b/vhost_user.h
@@ -183,7 +183,6 @@ struct vhost_user_msg {
  *
  * Return: true if the virqueue is enabled, false otherwise
  */
-/* cppcheck-suppress unusedFunction */
 static inline bool vu_queue_enabled(const struct vu_virtq *vq)
 {
 	return vq->enable;
@@ -195,14 +194,13 @@ static inline bool vu_queue_enabled(const struct vu_virtq *vq)
  *
  * Return: true if the virqueue is started, false otherwise
  */
-/* cppcheck-suppress unusedFunction */
 static inline bool vu_queue_started(const struct vu_virtq *vq)
 {
 	return vq->started;
 }
 
 void vu_print_capabilities(void);
-void vu_init(struct ctx *c, struct vu_dev *vdev);
+void vu_init(struct ctx *c);
 void vu_cleanup(struct vu_dev *vdev);
 void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);
 #endif /* VHOST_USER_H */
diff --git a/virtio.c b/virtio.c
index 380590afbca3..0598ff479858 100644
--- a/virtio.c
+++ b/virtio.c
@@ -328,7 +328,6 @@ static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
  * @dev:	Vhost-user device
  * @vq:		Virtqueue
  */
-/* cppcheck-suppress unusedFunction */
 void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
 {
 	if (!vq->vring.avail)
@@ -504,7 +503,6 @@ static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned i
  *
  * Return: -1 if there is an error, 0 otherwise
  */
-/* cppcheck-suppress unusedFunction */
 int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
 {
 	unsigned int head;
@@ -565,7 +563,6 @@ void vu_queue_unpop(struct vu_virtq *vq)
  * @vq:		Virtqueue
  * @num:	Number of element to unpop
  */
-/* cppcheck-suppress unusedFunction */
 bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
 {
 	if (num > vq->inuse)
@@ -621,7 +618,6 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
  * @len:	Size of the element
  * @idx:	Used ring entry index
  */
-/* cppcheck-suppress unusedFunction */
 void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
 		   unsigned int len, unsigned int idx)
 {
@@ -645,7 +641,6 @@ static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
  * @vq:		Virtqueue
  * @count:	Number of entry to flush
  */
-/* cppcheck-suppress unusedFunction */
 void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
 {
 	uint16_t old, new;
diff --git a/vu_common.c b/vu_common.c
new file mode 100644
index 000000000000..4977d6af0f92
--- /dev/null
+++ b/vu_common.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * common_vu.c - vhost-user common UDP and TCP functions
+ */
+
+#include <unistd.h>
+#include <sys/uio.h>
+#include <sys/eventfd.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "passt.h"
+#include "tap.h"
+#include "vhost_user.h"
+#include "pcap.h"
+#include "vu_common.h"
+
+/**
+ * vu_packet_check_range() - Check if a given memory zone is contained in
+ * 			     a mapped guest memory region
+ * @buf:	Array of the available memory regions
+ * @offset:	Offset of data range in packet descriptor
+ * @size:	Length of desired data range
+ * @start:	Start of the packet descriptor
+ *
+ * Return: 0 if the zone is in a mapped memory region, -1 otherwise
+ */
+int vu_packet_check_range(void *buf, size_t offset, size_t len,
+			  const char *start)
+{
+	struct vu_dev_region *dev_region;
+
+	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
+		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+		char *m = (char *)dev_region->mmap_addr;
+
+		if (m <= start &&
+		    start + offset + len <= m + dev_region->mmap_offset +
+					       dev_region->size)
+			return 0;
+	}
+
+	return -1;
+}
+
+/**
+ * vu_init_elem() - initialize an array of virtqueue element with 1 iov in each
+ * @elem:	Array of virtqueue element to initialize
+ * @iov:	Array of iovec to assign to virtqueue element
+ * @elem_cnt:	Number of virtqueue element
+ */
+void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov, int elem_cnt)
+{
+	int i;
+
+	for (i = 0; i < elem_cnt; i++) {
+		elem[i].out_num = 0;
+		elem[i].out_sg = NULL;
+		elem[i].in_num = 1;
+		elem[i].in_sg = &iov[i];
+	}
+}
+
+/**
+ * vu_collect_one_frame() - collect virtio buffers from a given virtqueue for
+ *			    one frame
+ * @vdev:		vhost-user device
+ * @vq:			virtqueue to collect from
+ * @elem:		Array of virtqueue element
+ * 			each element must be initialized with one iovec entry
+ * 			in the in_sg array.
+ * @max_elem:		Number of virtqueue element in the array
+ * @size:		Maximum size of the data in the frame
+ * @hdrlen:		Size of the frame header
+ */
+int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq,
+			 struct vu_virtq_element *elem, int max_elem,
+			 size_t size, size_t hdrlen, size_t *frame_size)
+{
+	size_t current_size = 0;
+	struct iovec *iov;
+	int elem_cnt = 0;
+	int ret;
+
+	/* header is at least virtio_net_hdr_mrg_rxbuf */
+	hdrlen += sizeof(struct virtio_net_hdr_mrg_rxbuf);
+
+	/* collect first (or unique) element, it will contain header */
+	ret = vu_queue_pop(vdev, vq, &elem[0]);
+	if (ret < 0)
+		goto out;
+
+	if (elem[0].in_num < 1) {
+		warn("virtio-net receive queue contains no in buffers");
+		vu_queue_detach_element(vq);
+		goto out;
+	}
+
+	iov = &elem[elem_cnt].in_sg[0];
+
+	ASSERT(iov->iov_len >= hdrlen);
+
+	/* add space for header */
+	iov->iov_base = (char *)iov->iov_base + hdrlen;
+	iov->iov_len -= hdrlen;
+
+	if (iov->iov_len > size)
+		iov->iov_len = size;
+
+	elem_cnt++;
+	current_size = iov->iov_len;
+
+	if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+		goto out;
+
+	/* if possible, coalesce buffers to reach size */
+	while (current_size < size && elem_cnt < max_elem) {
+
+		ret = vu_queue_pop(vdev, vq, &elem[elem_cnt]);
+		if (ret < 0)
+			break;
+
+		if (elem[elem_cnt].in_num < 1) {
+			warn("virtio-net receive queue contains no in buffers");
+			vu_queue_detach_element(vq);
+			break;
+		}
+
+		iov = &elem[elem_cnt].in_sg[0];
+
+		if (iov->iov_len > size - current_size)
+			iov->iov_len = size - current_size;
+
+		current_size += iov->iov_len;
+		elem_cnt++;
+	}
+
+out:
+	if (frame_size)
+		*frame_size = current_size;
+
+	return elem_cnt;
+}
+
+/**
+ * vu_collect() - collect virtio buffers from a given virtqueue
+ * @vdev:		vhost-user device
+ * @vq:			virtqueue to collect from
+ * @elem:		Array of virtqueue element
+ * 			each element must be initialized with one iovec entry
+ * 			in the in_sg array.
+ * @max_elem:		Number of virtqueue element in the array
+ * @max_frame_size:	Maximum size of the data in the frame
+ * @hdrlen:		Size of the frame header
+ * @size:		Total size of the buffers we need to collect
+ * 			(if size > max_frame_size, we collect several frame)
+ */
+int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
+	       struct vu_virtq_element *elem, int max_elem,
+	       size_t max_frame_size, size_t hdrlen, size_t size)
+{
+	int elem_cnt = 0;
+
+	while (size > 0 && elem_cnt < max_elem) {
+		size_t frame_size;
+		int cnt;
+
+		if (max_frame_size > size)
+			max_frame_size = size;
+
+		cnt = vu_collect_one_frame(vdev, vq,
+					   &elem[elem_cnt], max_elem - elem_cnt,
+					   max_frame_size, hdrlen, &frame_size);
+		if (cnt == 0)
+			break;
+
+		size -= frame_size;
+		elem_cnt += cnt;
+
+		if (frame_size < max_frame_size)
+			break;
+	}
+
+	return elem_cnt;
+}
+
+/**
+ * vu_set_vnethdr() - set virtio-net headers in a given iovec
+ * @vdev:		vhost-user device
+ * @iov:		One iovec to initialize
+ * @num_buffers:	Number of guest buffers of the frame
+ * @hdrlen:		Size of the frame header
+ */
+void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov,
+		    int num_buffers, size_t hdrlen)
+{
+	struct virtio_net_hdr_mrg_rxbuf *vnethdr;
+
+	/* header is at least virtio_net_hdr_mrg_rxbuf */
+	hdrlen += sizeof(struct virtio_net_hdr_mrg_rxbuf);
+
+	/* NOLINTNEXTLINE(clang-analyzer-core.UndefinedBinaryOperatorResult) */
+	iov->iov_base = (char *)iov->iov_base - hdrlen;
+	iov->iov_len += hdrlen;
+
+	vnethdr = iov->iov_base;
+	vnethdr->hdr = VU_HEADER;
+	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+		vnethdr->num_buffers = htole16(num_buffers);
+}
+
+/**
+ * vu_flush() - flush all the collected buffers to the vhost-user interface
+ * @vdev:	vhost-user device
+ * @vq:		vhost-user virtqueue
+ * @elem:	virtqueue element array to send back to the virqueue
+ * @iov_used:	Length of the array
+ */
+void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
+	      struct vu_virtq_element *elem, int elem_cnt)
+{
+	int i;
+
+	for (i = 0; i < elem_cnt; i++)
+		vu_queue_fill(vq, &elem[i], elem[i].in_sg[0].iov_len, i);
+
+	vu_queue_flush(vq, elem_cnt);
+	vu_queue_notify(vdev, vq);
+}
+
+/**
+ * vu_handle_tx() - Receive data from the TX virtqueue
+ * @vdev:	vhost-user device
+ * @index:	index of the virtqueue
+ * @now:	Current timestamp
+ */
+static void vu_handle_tx(struct vu_dev *vdev, int index,
+			 const struct timespec *now)
+{
+	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
+	struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
+	struct vu_virtq *vq = &vdev->vq[index];
+	int hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	int out_sg_count;
+	int count;
+
+	if (!VHOST_USER_IS_QUEUE_TX(index)) {
+		debug("vhost-user: index %d is not a TX queue", index);
+		return;
+	}
+
+	tap_flush_pools();
+
+	count = 0;
+	out_sg_count = 0;
+	while (count < VIRTQUEUE_MAX_SIZE) {
+		int ret;
+
+		elem[count].out_num = 1;
+		elem[count].out_sg = &out_sg[out_sg_count];
+		elem[count].in_num = 0;
+		elem[count].in_sg = NULL;
+		ret = vu_queue_pop(vdev, vq, &elem[count]);
+		if (ret < 0)
+			break;
+		out_sg_count += elem[count].out_num;
+
+		if (elem[count].out_num < 1) {
+			debug("virtio-net header not in first element");
+			break;
+		}
+		ASSERT(elem[count].out_num == 1);
+
+		tap_add_packet(vdev->context,
+			       elem[count].out_sg[0].iov_len - hdrlen,
+			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
+		count++;
+	}
+	tap_handler(vdev->context, now);
+
+	if (count) {
+		int i;
+
+		for (i = 0; i < count; i++)
+			vu_queue_fill(vq, &elem[i], 0, i);
+		vu_queue_flush(vq, count);
+		vu_queue_notify(vdev, vq);
+	}
+}
+
+/**
+ * vu_kick_cb() - Called on a kick event to start to receive data
+ * @vdev:	vhost-user device
+ * @ref:	epoll reference information
+ * @now:	Current timestamp
+ */
+void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
+		const struct timespec *now)
+{
+	eventfd_t kick_data;
+	ssize_t rc;
+	int idx;
+
+	for (idx = 0; idx < VHOST_USER_MAX_QUEUES; idx++) {
+		if (vdev->vq[idx].kick_fd == ref.fd)
+			break;
+	}
+
+	if (idx == VHOST_USER_MAX_QUEUES)
+		return;
+
+	rc = eventfd_read(ref.fd, &kick_data);
+	if (rc == -1)
+		die_perror("vhost-user kick eventfd_read()");
+
+	debug("vhost-user: ot kick_data: %016"PRIx64" idx:%d",
+	      kick_data, idx);
+	if (VHOST_USER_IS_QUEUE_TX(idx))
+		vu_handle_tx(vdev, idx, now);
+}
+
+/**
+ * vu_send_single() - Send a buffer to the front-end using the RX virtqueue
+ * @c:		execution context
+ * @buf:	address of the buffer
+ * @size:	size of the buffer
+ *
+ * Return: number of bytes sent, -1 if there is an error
+ */
+int vu_send_single(const struct ctx *c, const void *buf, size_t size)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
+	struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
+	size_t total;
+	int elem_cnt, max_elem;
+	int i;
+
+	debug("vu_send_single size %zu", size);
+
+	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
+		err("Got packet, but no available descriptors on RX virtq.");
+		return 0;
+	}
+
+	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+		max_elem = VIRTQUEUE_MAX_SIZE;
+	else
+		max_elem = 1;
+
+	vu_init_elem(elem, in_sg, max_elem);
+
+	elem_cnt = vu_collect_one_frame(vdev, vq, elem, max_elem, size,
+					0, &total);
+	if (total < size) {
+		debug("vu_send_single: no space to send the data "
+		      "elem_cnt %d size %zd", elem_cnt, total);
+		goto err;
+	}
+
+	vu_set_vnethdr(vdev, in_sg, elem_cnt, 0);
+
+	/* copy data from the buffer to the iovec */
+	iov_from_buf(in_sg, elem_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf),
+		     buf, size);
+
+	if (*c->pcap) {
+		pcap_iov(in_sg, elem_cnt,
+			 sizeof(struct virtio_net_hdr_mrg_rxbuf));
+	}
+
+	vu_flush(vdev, vq, elem, elem_cnt);
+
+	debug("vhost-user sent %zu", total);
+
+	return total;
+err:
+	for (i = 0; i < elem_cnt; i++)
+		vu_queue_detach_element(vq);
+
+	return 0;
+}
diff --git a/vu_common.h b/vu_common.h
new file mode 100644
index 000000000000..1d6048060059
--- /dev/null
+++ b/vu_common.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * vhost-user common UDP and TCP functions
+ */
+
+#ifndef VU_COMMON_H
+#define VU_COMMON_H
+#include <linux/virtio_net.h>
+
+static inline void *vu_eth(void *base)
+{
+	return ((char *)base + sizeof(struct virtio_net_hdr_mrg_rxbuf));
+}
+
+static inline void *vu_ip(void *base)
+{
+	return (struct ethhdr *)vu_eth(base) + 1;
+}
+
+static inline void *vu_payloadv4(void *base)
+{
+	return (struct iphdr *)vu_ip(base) + 1;
+}
+
+static inline void *vu_payloadv6(void *base)
+{
+	return (struct ipv6hdr *)vu_ip(base) + 1;
+}
+
+void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov,
+		  int elem_cnt);
+int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq,
+			 struct vu_virtq_element *elem, int max_elem,
+			 size_t size, size_t hdrlen, size_t *frame_size);
+int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
+	       struct vu_virtq_element *elem, int max_elem,
+	       size_t max_frame_size, size_t hdrlen, size_t size);
+void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov,
+                    int num_buffers, size_t hdrlen);
+void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
+	      struct vu_virtq_element *elem, int elem_cnt);
+void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
+		const struct timespec *now);
+int vu_send_single(const struct ctx *c, const void *buf, size_t size);
+#endif /* VU_COMMON_H */
-- 
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * vhost-user common UDP and TCP functions
+ */
+
+#ifndef VU_COMMON_H
+#define VU_COMMON_H
+#include <linux/virtio_net.h>
+
+static inline void *vu_eth(void *base)
+{
+	return ((char *)base + sizeof(struct virtio_net_hdr_mrg_rxbuf));
+}
+
+static inline void *vu_ip(void *base)
+{
+	return (struct ethhdr *)vu_eth(base) + 1;
+}
+
+static inline void *vu_payloadv4(void *base)
+{
+	return (struct iphdr *)vu_ip(base) + 1;
+}
+
+static inline void *vu_payloadv6(void *base)
+{
+	return (struct ipv6hdr *)vu_ip(base) + 1;
+}
+
+void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov,
+		  int elem_cnt);
+int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq,
+			 struct vu_virtq_element *elem, int max_elem,
+			 size_t size, size_t hdrlen, size_t *frame_size);
+int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
+	       struct vu_virtq_element *elem, int max_elem,
+	       size_t max_frame_size, size_t hdrlen, size_t size);
+void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov,
+                    int num_buffers, size_t hdrlen);
+void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
+	      struct vu_virtq_element *elem, int elem_cnt);
+void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
+		const struct timespec *now);
+int vu_send_single(const struct ctx *c, const void *buf, size_t size);
+#endif /* VU_COMMON_H */
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v8 8/8] test: Add tests for passt in vhost-user mode
  2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
                   ` (6 preceding siblings ...)
  2024-10-10 12:29 ` [PATCH v8 7/8] vhost-user: add vhost-user Laurent Vivier
@ 2024-10-10 12:29 ` Laurent Vivier
  2024-10-15  3:40   ` David Gibson
  2024-10-15 19:54   ` Stefano Brivio
  7 siblings, 2 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-10 12:29 UTC (permalink / raw)
  To: passt-dev; +Cc: Stefano Brivio, Laurent Vivier

From: Stefano Brivio <sbrivio@redhat.com>

Run functional and performance tests for vhost-user mode as well. For
functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
to their non-vhost-user counterparts, as no differences are intended
but we want to distinguish them in test logs.

For performance tests, instead, we add separate perf/passt_vu_tcp and
perf/passt_vu_udp files, as we need longer test duration, as well as
higher UDP sending bandwidths and larger TCP windows, to actually get
the highest throughput vhost-user mode offers.

For valgrind tests, vhost-user mode needs two extra system calls:
statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind
target.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile               |   3 +-
 test/lib/perf_report   |  15 +++
 test/lib/setup         |  77 ++++++++++++---
 test/lib/setup_ugly    |   2 +-
 test/passt_vu          |   1 +
 test/passt_vu_in_ns    |   1 +
 test/perf/passt_vu_tcp | 211 +++++++++++++++++++++++++++++++++++++++++
 test/perf/passt_vu_udp | 159 +++++++++++++++++++++++++++++++
 test/run               |  25 +++++
 test/two_guests_vu     |   1 +
 10 files changed, 479 insertions(+), 16 deletions(-)
 create mode 120000 test/passt_vu
 create mode 120000 test/passt_vu_in_ns
 create mode 100644 test/perf/passt_vu_tcp
 create mode 100644 test/perf/passt_vu_udp
 create mode 120000 test/two_guests_vu

diff --git a/Makefile b/Makefile
index 1e8910dda1f4..ce8aa4302790 100644
--- a/Makefile
+++ b/Makefile
@@ -138,7 +138,8 @@ qrap: $(QRAP_SRCS) passt.h
 
 valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction	\
 			    rt_sigreturn getpid gettid kill clock_gettime mmap \
-			    mmap2 munmap open unlink gettimeofday futex
+			    mmap2 munmap open unlink gettimeofday futex statx \
+			    readlink
 valgrind: FLAGS += -g -DVALGRIND
 valgrind: all
 
diff --git a/test/lib/perf_report b/test/lib/perf_report
index d1ef50bfe0d5..c4ec817bcd1e 100755
--- a/test/lib/perf_report
+++ b/test/lib/perf_report
@@ -49,6 +49,21 @@ td:empty { visibility: hidden; }
 	__passt_tcp_LINE__ __passt_udp_LINE__
 </table>
 
+</li><li><p>passt with vhost-user support</p>
+<table class="passt" width="70%">
+	<tr>
+		<th/>
+		<th id="perf_passt_vu_tcp" colspan="__passt_vu_tcp_cols__">TCP, __passt_vu_tcp_threads__ at __passt_vu_tcp_freq__ GHz</th>
+		<th id="perf_passt_vu_udp" colspan="__passt_vu_udp_cols__">UDP, __passt_vu_udp_threads__ at __passt_vu_udp_freq__ GHz</th>
+	</tr>
+	<tr>
+		<td align="right">MTU:</td>
+		__passt_vu_tcp_header__
+		__passt_vu_udp_header__
+	</tr>
+	__passt_vu_tcp_LINE__ __passt_vu_udp_LINE__
+</table>
+
 <style type="text/CSS">
 table.pasta_local td { border: 0px solid; padding: 6px; line-height: 1; }
 table.pasta_local td { text-align: right; }
diff --git a/test/lib/setup b/test/lib/setup
index 5338393ce35c..3409bd29cd81 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -15,8 +15,7 @@
 
 INITRAMFS="${BASEPATH}/mbuto.img"
 VCPUS="$( [ $(nproc) -ge 8 ] && echo 6 || echo $(( $(nproc) / 2 + 1 )) )"
-__mem_kib="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
-VMEM="$((${__mem_kib} / 1024 / 4))"
+MEM_KIB="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
 QEMU_ARCH="$(uname -m)"
 [ "${QEMU_ARCH}" = "i686" ] && QEMU_ARCH=i386
 
@@ -46,6 +45,7 @@ setup_passt() {
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
 
 	context_run passt "make clean"
 	context_run passt "make valgrind"
@@ -54,16 +54,29 @@ setup_passt() {
 	# pidfile isn't created until passt is listening
 	wait_for [ -f "${STATESETUP}/passt.pid" ]
 
+	__vmem="$((${MEM_KIB} / 1024 / 4))"
+	if [ ${VHOST_USER} -eq 1 ]; then
+		__vmem="$(((${__vmem} + 500) / 1000))G"
+		__qemu_netdev="						       \
+			-chardev socket,id=c,path=${STATESETUP}/passt.socket   \
+			-netdev vhost-user,id=v,chardev=c		       \
+			-device virtio-net,netdev=v			       \
+			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+			-numa node,memdev=m"
+	else
+		__qemu_netdev="-device virtio-net-pci,netdev=s		       \
+			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
+	fi
+
 	GUEST_CID=94557
 	context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}"		   \
 		' -machine accel=kvm'                                      \
-		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		   \
 		' -kernel '"${KERNEL}"					   \
 		' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
 		' -nodefaults'						   \
 		' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
-		' -device virtio-net-pci,netdev=s0 '			   \
-		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
+		" ${__qemu_netdev}"					   \
 		" -pidfile ${STATESETUP}/qemu.pid"			   \
 		" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
 
@@ -142,6 +155,7 @@ setup_passt_in_ns() {
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_in_pasta.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
 
 	if [ ${VALGRIND} -eq 1 ]; then
 		context_run passt "make clean"
@@ -154,17 +168,30 @@ setup_passt_in_ns() {
 	fi
 	wait_for [ -f "${STATESETUP}/passt.pid" ]
 
+	__vmem="$((${MEM_KIB} / 1024 / 4))"
+	if [ ${VHOST_USER} -eq 1 ]; then
+		__vmem="$(((${__vmem} + 500) / 1000))G"
+		__qemu_netdev="						       \
+			-chardev socket,id=c,path=${STATESETUP}/passt.socket   \
+			-netdev vhost-user,id=v,chardev=c		       \
+			-device virtio-net,netdev=v			       \
+			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+			-numa node,memdev=m"
+	else
+		__qemu_netdev="-device virtio-net-pci,netdev=s		       \
+			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
+	fi
+
 	GUEST_CID=94557
 	context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}"		   \
 		' -machine accel=kvm'                                      \
 		' -M accel=kvm:tcg'                                        \
-		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		   \
 		' -kernel '"${KERNEL}"					   \
 		' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
 		' -nodefaults'						   \
 		' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
-		' -device virtio-net-pci,netdev=s0 '			   \
-		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
+		" ${__qemu_netdev}"					   \
 		" -pidfile ${STATESETUP}/qemu.pid"			   \
 		" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
 
@@ -214,6 +241,7 @@ setup_two_guests() {
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
 
 	context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
 	wait_for [ -f "${STATESETUP}/passt_1.pid" ]
@@ -222,33 +250,54 @@ setup_two_guests() {
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
 
 	context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
 	wait_for [ -f "${STATESETUP}/passt_2.pid" ]
 
+	__vmem="$((${MEM_KIB} / 1024 / 4))"
+	if [ ${VHOST_USER} -eq 1 ]; then
+		__vmem="$(((${__vmem} + 500) / 1000))G"
+		__qemu_netdev1="					       \
+			-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
+			-netdev vhost-user,id=v,chardev=c		       \
+			-device virtio-net,netdev=v			       \
+			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+			-numa node,memdev=m"
+		__qemu_netdev1="					       \
+			-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
+			-netdev vhost-user,id=v,chardev=c		       \
+			-device virtio-net,netdev=v			       \
+			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+			-numa node,memdev=m"
+	else
+		__qemu_netdev1="-device virtio-net-pci,netdev=s		       \
+			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket"
+		__qemu_netdev2="-device virtio-net-pci,netdev=s		       \
+			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket"
+	fi
+
 	GUEST_1_CID=94557
 	context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}"		     \
 		' -M accel=kvm:tcg'                                          \
-		' -m '${VMEM}' -cpu host -smp '${VCPUS}                      \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
 		' -kernel '"${KERNEL}"					     \
 		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
 		' -nodefaults'						     \
 		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
-		' -device virtio-net-pci,netdev=s0 '			     \
-		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket " \
+		" ${__qemu_netdev1}"					     \
 		" -pidfile ${STATESETUP}/qemu_1.pid"			     \
 		" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"
 
 	GUEST_2_CID=94558
 	context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}"		     \
 		' -M accel=kvm:tcg'                                          \
-		' -m '${VMEM}' -cpu host -smp '${VCPUS}                      \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
 		' -kernel '"${KERNEL}"					     \
 		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
 		' -nodefaults'						     \
 		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
-		' -device virtio-net-pci,netdev=s0 '			     \
-		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket " \
+		" ${__qemu_netdev2}"					     \
 		" -pidfile ${STATESETUP}/qemu_2.pid"			     \
 		" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"
 
diff --git a/test/lib/setup_ugly b/test/lib/setup_ugly
index 4b2a0774de1d..2802cc3bb43b 100755
--- a/test/lib/setup_ugly
+++ b/test/lib/setup_ugly
@@ -33,7 +33,7 @@ setup_memory() {
 
 	pane_or_context_run guest 'qemu-system-$(uname -m)'		   \
 		' -machine accel=kvm'                                      \
-		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
+		' -m '$((${MEM_KIB} / 1024 / 4))' -cpu host -smp '${VCPUS}                    \
 		' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
 		' -initrd '${INITRAMFS_MEM}' -nographic -serial stdio'	   \
 		' -nodefaults'						   \
diff --git a/test/passt_vu b/test/passt_vu
new file mode 120000
index 000000000000..22f1840d1ad6
--- /dev/null
+++ b/test/passt_vu
@@ -0,0 +1 @@
+passt
\ No newline at end of file
diff --git a/test/passt_vu_in_ns b/test/passt_vu_in_ns
new file mode 120000
index 000000000000..3ff479e0436b
--- /dev/null
+++ b/test/passt_vu_in_ns
@@ -0,0 +1 @@
+passt_in_ns
\ No newline at end of file
diff --git a/test/perf/passt_vu_tcp b/test/perf/passt_vu_tcp
new file mode 100644
index 000000000000..b43400804e64
--- /dev/null
+++ b/test/perf/passt_vu_tcp
@@ -0,0 +1,211 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/perf/passt_vu_tcp - Check TCP performance in passt vhost-user mode
+#
+# Copyright (c) 2021 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+gtools	/sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr # From neper
+nstools	/sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr
+htools	bc head sed seq
+
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	passt: throughput and latency
+
+guest	/sbin/sysctl -w net.core.rmem_max=536870912
+guest	/sbin/sysctl -w net.core.wmem_max=536870912
+guest	/sbin/sysctl -w net.core.rmem_default=33554432
+guest	/sbin/sysctl -w net.core.wmem_default=33554432
+guest	/sbin/sysctl -w net.ipv4.tcp_rmem="4096 131072 268435456"
+guest	/sbin/sysctl -w net.ipv4.tcp_wmem="4096 131072 268435456"
+guest	/sbin/sysctl -w net.ipv4.tcp_timestamps=0
+
+ns	/sbin/sysctl -w net.ipv4.tcp_rmem="4096 524288 134217728"
+ns	/sbin/sysctl -w net.ipv4.tcp_wmem="4096 524288 134217728"
+ns	/sbin/sysctl -w net.ipv4.tcp_timestamps=0
+
+gout	IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+
+hout	FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
+hout	FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
+hout	FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
+
+set	THREADS 4
+set	TIME 5
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -l 1M -O__OMIT__ -N
+
+info	Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
+report	passt_vu tcp __THREADS__ __FREQ__
+
+th	MTU 256B 576B 1280B 1500B 9000B 65520B
+
+
+tr	TCP throughput over IPv6: guest to host
+iperf3s	ns 10002
+
+bw	-
+bw	-
+guest	ip link set dev __IFNAME__ mtu 1280
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 16M
+bw	__BW__ 1.2 1.5
+guest	ip link set dev __IFNAME__ mtu 1500
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 1.6 1.8
+guest	ip link set dev __IFNAME__ mtu 9000
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M
+bw	__BW__ 4.0 5.0
+guest	ip link set dev __IFNAME__ mtu 65520
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M
+bw	__BW__ 7.0 8.0
+
+iperf3k	ns
+
+tl	TCP RR latency over IPv6: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_rr --nolog -6
+gout	LAT tcp_rr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv6: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_crr --nolog -6
+gout	LAT tcp_crr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 400
+
+tr	TCP throughput over IPv4: guest to host
+iperf3s	ns 10002
+
+guest	ip link set dev __IFNAME__ mtu 256
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 2M
+bw	__BW__ 0.2 0.3
+guest	ip link set dev __IFNAME__ mtu 576
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 4M
+bw	__BW__ 0.5 0.8
+guest	ip link set dev __IFNAME__ mtu 1280
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 8M
+bw	__BW__ 1.2 1.5
+guest	ip link set dev __IFNAME__ mtu 1500
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 16M
+bw	__BW__ 1.6 1.8
+guest	ip link set dev __IFNAME__ mtu 9000
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M
+bw	__BW__ 4.0 5.0
+guest	ip link set dev __IFNAME__ mtu 65520
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M
+bw	__BW__ 7.0 8.0
+
+iperf3k	ns
+
+# Reducing MTU below 1280 deconfigures IPv6, get our address back
+guest	dhclient -6 -x
+guest	dhclient -6 __IFNAME__
+
+tl	TCP RR latency over IPv4: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_rr --nolog -4
+gout	LAT tcp_rr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv4: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_crr --nolog -4
+gout	LAT tcp_crr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 400
+
+tr	TCP throughput over IPv6: host to guest
+iperf3s	guest 10001
+
+bw	-
+bw	-
+bw	-
+bw	-
+bw	-
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 6.0 6.8
+
+iperf3k	guest
+
+tl	TCP RR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_rr --nolog -P 10001 -C 10011 -6
+sleep	1
+nsout	LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_crr --nolog -P 10001 -C 10011 -6
+sleep	1
+nsout	LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 350
+
+
+tr	TCP throughput over IPv4: host to guest
+iperf3s	guest 10001
+
+bw	-
+bw	-
+bw	-
+bw	-
+bw	-
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 6.0 6.8
+
+iperf3k	guest
+
+tl	TCP RR latency over IPv4: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_rr --nolog -P 10001 -C 10011 -4
+sleep	1
+nsout	LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_crr --nolog -P 10001 -C 10011 -4
+sleep	1
+nsout	LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 300
+
+te
diff --git a/test/perf/passt_vu_udp b/test/perf/passt_vu_udp
new file mode 100644
index 000000000000..943ac11b4a51
--- /dev/null
+++ b/test/perf/passt_vu_udp
@@ -0,0 +1,159 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/perf/passt_vu_udp - Check UDP performance in passt vhost-user mode
+#
+# Copyright (c) 2021 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+gtools	/sbin/sysctl ip jq nproc sleep iperf3 udp_rr # From neper
+nstools	ip jq sleep iperf3 udp_rr
+htools	bc head sed
+
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	passt: throughput and latency
+
+guest	/sbin/sysctl -w net.core.rmem_max=16777216
+guest	/sbin/sysctl -w net.core.wmem_max=16777216
+guest	/sbin/sysctl -w net.core.rmem_default=16777216
+guest	/sbin/sysctl -w net.core.wmem_default=16777216
+
+hout	FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
+hout	FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
+hout	FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
+
+set	THREADS 2
+set	TIME 1
+set	OPTS -u -P __THREADS__ --pacing-timer 1000
+
+info	Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
+
+report	passt_vu udp __THREADS__ __FREQ__
+
+th	pktlen 256B 576B 1280B 1500B 9000B 65520B
+
+tr	UDP throughput over IPv6: guest to host
+iperf3s	ns 10002
+# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
+
+bw	-
+bw	-
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 3G -l 1232
+bw	__BW__ 0.8 1.2
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 4G -l 1452
+bw	__BW__ 1.0 1.5
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 10G -l 8952
+bw	__BW__ 4.0 5.0
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 20G -l 64372
+bw	__BW__ 4.0 5.0
+
+iperf3k	ns
+
+tl	UDP RR latency over IPv6: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	udp_rr --nolog -6
+gout	LAT udp_rr --nolog -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+
+tr	UDP throughput over IPv4: guest to host
+iperf3s	ns 10002
+# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
+
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 1G -l 228
+bw	__BW__ 0.0 0.0
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 2G -l 548
+bw	__BW__ 0.4 0.6
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 3G -l 1252
+bw	__BW__ 0.8 1.2
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 4G -l 1472
+bw	__BW__ 1.0 1.5
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 10G -l 8972
+bw	__BW__ 4.0 5.0
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 20G -l 65492
+bw	__BW__ 4.0 5.0
+
+iperf3k	ns
+
+tl	UDP RR latency over IPv4: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	udp_rr --nolog -4
+gout	LAT udp_rr --nolog -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+
+tr	UDP throughput over IPv6: host to guest
+iperf3s	guest 10001
+# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
+
+bw	-
+bw	-
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 3G -l 1232
+bw	__BW__ 0.8 1.2
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 4G -l 1452
+bw	__BW__ 1.0 1.5
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 10G -l 8952
+bw	__BW__ 3.0 4.0
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 20G -l 64372
+bw	__BW__ 3.0 4.0
+
+iperf3k	guest
+
+tl	UDP RR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	udp_rr --nolog -P 10001 -C 10011 -6
+sleep	1
+nsout	LAT udp_rr --nolog -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+
+tr	UDP throughput over IPv4: host to guest
+iperf3s	guest 10001
+# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
+
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 1G -l 228
+bw	__BW__ 0.0 0.0
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 2G -l 548
+bw	__BW__ 0.4 0.6
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 3G -l 1252
+bw	__BW__ 0.8 1.2
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 4G -l 1472
+bw	__BW__ 1.0 1.5
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 10G -l 8972
+bw	__BW__ 3.0 4.0
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 20G -l 65492
+bw	__BW__ 3.0 4.0
+
+iperf3k	guest
+
+tl	UDP RR latency over IPv4: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	udp_rr --nolog -P 10001 -C 10011 -4
+sleep	1
+nsout	LAT udp_rr --nolog -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+te
diff --git a/test/run b/test/run
index 547a729b3fbe..f188d8eaf2e0 100755
--- a/test/run
+++ b/test/run
@@ -93,6 +93,7 @@ run() {
 	test memory/passt
 	teardown memory
 
+	VHOST_USER=0
 	setup passt
 	test passt/ndp
 	test passt/dhcp
@@ -115,7 +116,22 @@ run() {
 	test two_guests/basic
 	teardown two_guests
 
+	VHOST_USER=1
+	setup passt_in_ns
+	test passt_vu/ndp
+	test passt_vu_in_ns/dhcp
+	test passt_vu_in_ns/icmp
+	test passt_vu_in_ns/tcp
+	test passt_vu_in_ns/udp
+	test passt_vu_in_ns/shutdown
+	teardown passt_in_ns
+
+	setup two_guests
+	test two_guests_vu/basic
+	teardown two_guests
+
 	VALGRIND=0
+	VHOST_USER=0
 	setup passt_in_ns
 	test passt/ndp
 	test passt_in_ns/dhcp
@@ -126,6 +142,15 @@ run() {
 	test passt_in_ns/shutdown
 	teardown passt_in_ns
 
+	VHOST_USER=1
+	setup passt_in_ns
+	test passt_vu/ndp
+	test passt_vu_in_ns/dhcp
+	test perf/passt_vu_tcp
+	test perf/passt_vu_udp
+	test passt_vu_in_ns/shutdown
+	teardown passt_in_ns
+
 	# TODO: Make those faster by at least pre-installing gcc and make on
 	# non-x86 images, then re-enable.
 skip_distro() {
diff --git a/test/two_guests_vu b/test/two_guests_vu
new file mode 120000
index 000000000000..144b7cac5438
--- /dev/null
+++ b/test/two_guests_vu
@@ -0,0 +1 @@
+test/two_guests
\ No newline at end of file
-- 
@@ -0,0 +1 @@
+test/two_guests
\ No newline at end of file
-- 
2.46.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 4/8] udp: Prepare udp.c to be shared with vhost-user
  2024-10-10 12:28 ` [PATCH v8 4/8] udp: Prepare udp.c to be shared with vhost-user Laurent Vivier
@ 2024-10-14  4:29   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2024-10-14  4:29 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7659 bytes --]

On Thu, Oct 10, 2024 at 02:28:58PM +0200, Laurent Vivier wrote:
> Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and
> udp_sock_errs().
> 
> Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and
> udp_reply_sock_handler to udp_buf_reply_sock_handler().
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  udp.c          | 74 ++++++++++++++++++++++++++++++--------------------
>  udp_internal.h | 34 +++++++++++++++++++++++
>  2 files changed, 79 insertions(+), 29 deletions(-)
>  create mode 100644 udp_internal.h
> 
> diff --git a/udp.c b/udp.c
> index 100610f2472e..8fc5d8099310 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -109,8 +109,7 @@
>  #include "pcap.h"
>  #include "log.h"
>  #include "flow_table.h"
> -
> -#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
> +#include "udp_internal.h"
>  
>  /* "Spliced" sockets indexed by bound port (host order) */
>  static int udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
> @@ -118,20 +117,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
>  
>  /* Static buffers */
>  
> -/**
> - * struct udp_payload_t - UDP header and data for inbound messages
> - * @uh:		UDP header
> - * @data:	UDP data
> - */
> -static struct udp_payload_t {
> -	struct udphdr uh;
> -	char data[USHRT_MAX - sizeof(struct udphdr)];
> -#ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)))
> -#else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> -#endif
> -udp_payload[UDP_MAX_FRAMES];
> +/* UDP header and data for inbound messages */
> +static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
>  
>  /* Ethernet header for IPv4 frames */
>  static struct ethhdr udp4_eth_hdr;
> @@ -302,9 +289,9 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
>   *
>   * Return: size of IPv4 payload (UDP header + data)
>   */
> -static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
> -			      const struct flowside *toside, size_t dlen,
> -			      bool no_udp_csum)
> +size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
> +		       const struct flowside *toside, size_t dlen,
> +		       bool no_udp_csum)
>  {
>  	const struct in_addr *src = inany_v4(&toside->oaddr);
>  	const struct in_addr *dst = inany_v4(&toside->eaddr);
> @@ -345,9 +332,9 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
>   *
>   * Return: size of IPv6 payload (UDP header + data)
>   */
> -static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
> -			      const struct flowside *toside, size_t dlen,
> -			      bool no_udp_csum)
> +size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
> +		       const struct flowside *toside, size_t dlen,
> +		       bool no_udp_csum)
>  {
>  	uint16_t l4len = dlen + sizeof(bp->uh);
>  
> @@ -477,7 +464,7 @@ static int udp_sock_recverr(int s)
>   *
>   * Return: Number of errors handled, or < 0 if we have an unrecoverable error
>   */
> -static int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
> +int udp_sock_errs(const struct ctx *c, int s, uint32_t events)
>  {
>  	unsigned n_err = 0;
>  	socklen_t errlen;
> @@ -554,7 +541,7 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
>  }
>  
>  /**
> - * udp_listen_sock_handler() - Handle new data from socket
> + * udp_buf_listen_sock_handler() - Handle new data from socket
>   * @c:		Execution context
>   * @ref:	epoll reference
>   * @events:	epoll events bitmap
> @@ -562,8 +549,9 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
>   *
>   * #syscalls recvmmsg
>   */
> -void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> -			     uint32_t events, const struct timespec *now)
> +static void udp_buf_listen_sock_handler(const struct ctx *c,
> +					union epoll_ref ref, uint32_t events,
> +					const struct timespec *now)
>  {
>  	const socklen_t sasize = sizeof(udp_meta[0].s_in);
>  	int n, i;
> @@ -630,7 +618,21 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
>  }
>  
>  /**
> - * udp_reply_sock_handler() - Handle new data from flow specific socket
> + * udp_listen_sock_handler() - Handle new data from socket
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @events:	epoll events bitmap
> + * @now:	Current timestamp
> + */
> +void udp_listen_sock_handler(const struct ctx *c,
> +			     union epoll_ref ref, uint32_t events,
> +			     const struct timespec *now)
> +{
> +	udp_buf_listen_sock_handler(c, ref, events, now);
> +}
> +
> +/**
> + * udp_buf_reply_sock_handler() - Handle new data from flow specific socket
>   * @c:		Execution context
>   * @ref:	epoll reference
>   * @events:	epoll events bitmap
> @@ -638,8 +640,9 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
>   *
>   * #syscalls recvmmsg
>   */
> -void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> -			    uint32_t events, const struct timespec *now)
> +static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				       uint32_t events,
> +				       const struct timespec *now)
>  {
>  	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
>  	const struct flowside *toside = flowside_at_sidx(tosidx);
> @@ -684,6 +687,19 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
>  	}
>  }
>  
> +/**
> + * udp_reply_sock_handler() - Handle new data from flow specific socket
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @events:	epoll events bitmap
> + * @now:	Current timestamp
> + */
> +void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			    uint32_t events, const struct timespec *now)
> +{
> +	udp_buf_reply_sock_handler(c, ref, events, now);
> +}
> +
>  /**
>   * udp_tap_handler() - Handle packets from tap
>   * @c:		Execution context
> diff --git a/udp_internal.h b/udp_internal.h
> new file mode 100644
> index 000000000000..cc80e3055423
> --- /dev/null
> +++ b/udp_internal.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright (c) 2021 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#ifndef UDP_INTERNAL_H
> +#define UDP_INTERNAL_H
> +
> +#include "tap.h" /* needed by udp_meta_t */
> +
> +#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
> +
> +/**
> + * struct udp_payload_t - UDP header and data for inbound messages
> + * @uh:		UDP header
> + * @data:	UDP data
> + */
> +struct udp_payload_t {
> +	struct udphdr uh;
> +	char data[USHRT_MAX - sizeof(struct udphdr)];
> +#ifdef __AVX2__
> +} __attribute__ ((packed, aligned(32)));
> +#else
> +} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> +#endif
> +
> +size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
> +		       const struct flowside *toside, size_t dlen,
> +		       bool no_udp_csum);
> +size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
> +                       const struct flowside *toside, size_t dlen,
> +		       bool no_udp_csum);
> +int udp_sock_errs(const struct ctx *c, int s, uint32_t events);
> +#endif /* UDP_INTERNAL_H */

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 5/8] tcp: Export headers functions
  2024-10-10 12:28 ` [PATCH v8 5/8] tcp: Export headers functions Laurent Vivier
@ 2024-10-14  4:29   ` David Gibson
  0 siblings, 0 replies; 50+ messages in thread
From: David Gibson @ 2024-10-14  4:29 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4409 bytes --]

On Thu, Oct 10, 2024 at 02:28:59PM +0200, Laurent Vivier wrote:
> Export tcp_fill_headers[4|6]() and tcp_update_check_tcp[4|6]().
> 
> They'll be needed by vhost-user.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c          | 30 +++++++++++++++---------------
>  tcp_internal.h | 15 +++++++++++++++
>  2 files changed, 30 insertions(+), 15 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 9617b7ac2404..eae02b1647e3 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -761,9 +761,9 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
>   * @iov_cnt:	Length of the array
>   * @l4offset:	IPv4 payload offset in the iovec array
>   */
> -static void tcp_update_check_tcp4(const struct iphdr *iph,
> -				  const struct iovec *iov, int iov_cnt,
> -				  size_t l4offset)
> +void tcp_update_check_tcp4(const struct iphdr *iph,
> +			   const struct iovec *iov, int iov_cnt,
> +			   size_t l4offset)
>  {
>  	uint16_t l4len = ntohs(iph->tot_len) - sizeof(struct iphdr);
>  	struct in_addr saddr = { .s_addr = iph->saddr };
> @@ -813,9 +813,9 @@ static void tcp_update_check_tcp4(const struct iphdr *iph,
>   * @iov_cnt:	Length of the array
>   * @l4offset:	IPv6 payload offset in the iovec array
>   */
> -static void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
> -				  const struct iovec *iov, int iov_cnt,
> -				  size_t l4offset)
> +void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
> +			   const struct iovec *iov, int iov_cnt,
> +			   size_t l4offset)
>  {
>  	uint16_t l4len = ntohs(ip6h->payload_len);
>  	size_t check_ofs;
> @@ -982,11 +982,11 @@ static void tcp_fill_header(struct tcphdr *th,
>   *
>   * Return: The IPv4 payload length, host order
>   */
> -static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
> -				struct tap_hdr *taph,
> -				struct iphdr *iph, struct tcp_payload_t *bp,
> -				size_t dlen, const uint16_t *check,
> -				uint32_t seq, bool no_tcp_csum)
> +size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
> +			 struct tap_hdr *taph,
> +			 struct iphdr *iph, struct tcp_payload_t *bp,
> +			 size_t dlen, const uint16_t *check,
> +			 uint32_t seq, bool no_tcp_csum)
>  {
>  	const struct flowside *tapside = TAPFLOW(conn);
>  	const struct in_addr *src4 = inany_v4(&tapside->oaddr);
> @@ -1034,10 +1034,10 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
>   *
>   * Return: The IPv6 payload length, host order
>   */
> -static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
> -				struct tap_hdr *taph,
> -				struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
> -				size_t dlen, uint32_t seq, bool no_tcp_csum)
> +size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
> +			 struct tap_hdr *taph,
> +			 struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
> +			 size_t dlen, uint32_t seq, bool no_tcp_csum)
>  {
>  	const struct flowside *tapside = TAPFLOW(conn);
>  	size_t l4len = dlen + sizeof(bp->th);
> diff --git a/tcp_internal.h b/tcp_internal.h
> index 2f74ffeff8f3..8e87f98b470f 100644
> --- a/tcp_internal.h
> +++ b/tcp_internal.h
> @@ -118,6 +118,21 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn);
>  		tcp_rst_do(c, conn);					\
>  	} while (0)
>  
> +void tcp_update_check_tcp4(const struct iphdr *iph,
> +			   const struct iovec *iov, int iov_cnt,
> +			   size_t l4offset);
> +void tcp_update_check_tcp6(const struct ipv6hdr *ip6h,
> +			   const struct iovec *iov, int iov_cnt,
> +			   size_t l4offset);
> +size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
> +			 struct tap_hdr *taph,
> +			 struct iphdr *iph, struct tcp_payload_t *bp,
> +			 size_t dlen, const uint16_t *check,
> +			 uint32_t seq, bool no_tcp_csum);
> +size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
> +			 struct tap_hdr *taph,
> +			 struct ipv6hdr *ip6h, struct tcp_payload_t *bp,
> +			 size_t dlen, uint32_t seq, bool no_tcp_csum);
>  size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
>  			       struct iovec *iov, size_t dlen,
>  			       const uint16_t *check, uint32_t seq,

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init()
  2024-10-10 12:29 ` [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init() Laurent Vivier
@ 2024-10-14  4:30   ` David Gibson
  2024-10-14 22:38   ` Stefano Brivio
  1 sibling, 0 replies; 50+ messages in thread
From: David Gibson @ 2024-10-14  4:30 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4926 bytes --]

On Thu, Oct 10, 2024 at 02:29:00PM +0200, Laurent Vivier wrote:
> Extract pool storage initialization loop to tap_sock_update_pool(),
> extract QEMU hints to tap_backend_show_hints().
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  passt.c |  2 +-
>  tap.c   | 56 +++++++++++++++++++++++++++++++++++++++++---------------
>  tap.h   |  2 +-
>  3 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/passt.c b/passt.c
> index ad6f0bc32df6..79093ee02d62 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -261,7 +261,7 @@ int main(int argc, char **argv)
>  
>  	pasta_netns_quit_init(&c);
>  
> -	tap_sock_init(&c);
> +	tap_backend_init(&c);
>  
>  	secret_init(&c);
>  
> diff --git a/tap.c b/tap.c
> index c53a39b79e62..4b826fdf7adc 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1188,11 +1188,31 @@ int tap_sock_unix_open(char *sock_path)
>  	return fd;
>  }
>  
> +/**
> + * tap_backend_show_hints() - Give help information to start QEMU
> + * @c:		Execution context
> + */
> +static void tap_backend_show_hints(struct ctx *c)
> +{
> +	switch(c->mode) {
> +	case MODE_PASTA:
> +		/* No hints */
> +		break;
> +	case MODE_PASST:
> +		info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
> +		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> +		     c->sock_path);
> +		info("or qrap, for earlier qemu versions:");
> +		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +		break;
> +	}
> +}
> +
>  /**
>   * tap_sock_unix_init() - Start listening for connections on AF_UNIX socket
>   * @c:		Execution context
>   */
> -static void tap_sock_unix_init(struct ctx *c)
> +static void tap_sock_unix_init(const struct ctx *c)
>  {
>  	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
>  	struct epoll_event ev = { 0 };
> @@ -1203,12 +1223,6 @@ static void tap_sock_unix_init(struct ctx *c)
>  	ev.events = EPOLLIN | EPOLLET;
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
> -
> -	info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
> -	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> -	     c->sock_path);
> -	info("or qrap, for earlier qemu versions:");
> -	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
>  }
>  
>  /**
> @@ -1321,21 +1335,31 @@ static void tap_sock_tun_init(struct ctx *c)
>  }
>  
>  /**
> - * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
> - * @c:		Execution context
> + * tap_sock_update_pool() - Set the buffer base and size for the pool of packets
> + * @base:	Buffer base
> + * @size	Buffer size
>   */
> -void tap_sock_init(struct ctx *c)
> +static void tap_sock_update_pool(void *base, size_t size)
>  {
> -	size_t sz = sizeof(pkt_buf);
>  	int i;
>  
> -	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
> -	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
> +	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, base, size);
> +	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, base, size);
>  
>  	for (i = 0; i < TAP_SEQS; i++) {
> -		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
> -		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
> +		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
> +		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size);
>  	}
> +}
> +
> +/**
> + * tap_backend_init() - Create and set up AF_UNIX socket or
> + *			tuntap file descriptor
> + * @c:		Execution context
> + */
> +void tap_backend_init(struct ctx *c)
> +{
> +	tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
>  
>  	if (c->fd_tap != -1) { /* Passed as --fd */
>  		struct epoll_event ev = { 0 };
> @@ -1365,4 +1389,6 @@ void tap_sock_init(struct ctx *c)
>  		 */
>  		memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
>  	}
> +
> +	tap_backend_show_hints(c);
>  }
> diff --git a/tap.h b/tap.h
> index 85f1e8473711..8728cc5c09c3 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -68,7 +68,7 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  int tap_sock_unix_open(char *sock_path);
> -void tap_sock_init(struct ctx *c);
> +void tap_backend_init(struct ctx *c);
>  void tap_flush_pools(void);
>  void tap_handler(struct ctx *c, const struct timespec *now);
>  void tap_add_packet(struct ctx *c, ssize_t l2len, char *p);

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init()
  2024-10-10 12:29 ` [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init() Laurent Vivier
  2024-10-14  4:30   ` David Gibson
@ 2024-10-14 22:38   ` Stefano Brivio
  1 sibling, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-14 22:38 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 10 Oct 2024 14:29:00 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> Extract pool storage initialization loop to tap_sock_update_pool(),
> extract QEMU hints to tap_backend_show_hints().
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  passt.c |  2 +-
>  tap.c   | 56 +++++++++++++++++++++++++++++++++++++++++---------------
>  tap.h   |  2 +-
>  3 files changed, 43 insertions(+), 17 deletions(-)
> 
> diff --git a/passt.c b/passt.c
> index ad6f0bc32df6..79093ee02d62 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -261,7 +261,7 @@ int main(int argc, char **argv)
>  
>  	pasta_netns_quit_init(&c);
>  
> -	tap_sock_init(&c);
> +	tap_backend_init(&c);
>  
>  	secret_init(&c);
>  
> diff --git a/tap.c b/tap.c
> index c53a39b79e62..4b826fdf7adc 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1188,11 +1188,31 @@ int tap_sock_unix_open(char *sock_path)
>  	return fd;
>  }
>  
> +/**
> + * tap_backend_show_hints() - Give help information to start QEMU
> + * @c:		Execution context
> + */
> +static void tap_backend_show_hints(struct ctx *c)
> +{
> +	switch(c->mode) {

Nit: switch (c->mode) { ...

> +	case MODE_PASTA:
> +		/* No hints */
> +		break;
> +	case MODE_PASST:
> +		info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
> +		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> +		     c->sock_path);
> +		info("or qrap, for earlier qemu versions:");
> +		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +		break;
> +	}
> +}

The rest, up to this patch, looks good to me. I'm still reviewing 7/8.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-10 12:29 ` [PATCH v8 7/8] vhost-user: add vhost-user Laurent Vivier
@ 2024-10-15  3:23   ` David Gibson
  2024-10-16 10:07     ` Laurent Vivier
  2024-10-15 19:54   ` Stefano Brivio
  2024-10-17  0:10   ` Stefano Brivio
  2 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2024-10-15  3:23 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 71222 bytes --]

On Thu, Oct 10, 2024 at 02:29:01PM +0200, Laurent Vivier wrote:
> add virtio and vhost-user functions to connect with QEMU.
> 
>   $ ./passt --vhost-user
> 
> and
> 
>   # qemu-system-x86_64 ... -m 4G \
>         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>         -numa node,memdev=memfd0 \
>         -chardev socket,id=chr0,path=/tmp/passt_1.socket \
>         -netdev vhost-user,id=netdev0,chardev=chr0 \
>         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>         ...
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     |   6 +-
>  conf.c       |  21 ++-
>  epoll_type.h |   4 +
>  iov.c        |   1 -
>  isolation.c  |  15 +-
>  packet.c     |  11 ++
>  packet.h     |   8 +-
>  passt.1      |  10 +-
>  passt.c      |   9 +
>  passt.h      |   6 +
>  pcap.c       |   1 -
>  tap.c        |  80 +++++++--
>  tap.h        |   5 +-
>  tcp.c        |   7 +
>  tcp_vu.c     | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcp_vu.h     |  12 ++
>  udp.c        |  10 ++
>  udp_vu.c     | 336 ++++++++++++++++++++++++++++++++++++
>  udp_vu.h     |  13 ++
>  vhost_user.c |  37 ++--
>  vhost_user.h |   4 +-
>  virtio.c     |   5 -
>  vu_common.c  | 385 +++++++++++++++++++++++++++++++++++++++++
>  vu_common.h  |  47 +++++
>  24 files changed, 1454 insertions(+), 55 deletions(-)
>  create mode 100644 tcp_vu.c
>  create mode 100644 tcp_vu.h
>  create mode 100644 udp_vu.c
>  create mode 100644 udp_vu.h
>  create mode 100644 vu_common.c
>  create mode 100644 vu_common.h
> 
> diff --git a/Makefile b/Makefile
> index 0e8ed60a0da1..1e8910dda1f4 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -54,7 +54,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
> +	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> +	vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -64,7 +65,8 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h vhost_user.h virtio.h
> +	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> +	virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/conf.c b/conf.c
> index c63101970155..29d6e41f5770 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -45,6 +45,7 @@
>  #include "lineread.h"
>  #include "isolation.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /**
>   * next_chunk - Return the next piece of a string delimited by a character
> @@ -762,9 +763,14 @@ static void usage(const char *name, FILE *f, int status)
>  			"    default: same interface name as external one\n");
>  	} else {
>  		fprintf(f,
> -			"  -s, --socket PATH	UNIX domain socket path\n"
> +			"  -s, --socket, --socket-path PATH	UNIX domain socket path\n"
>  			"    default: probe free path starting from "
>  			UNIX_SOCK_PATH "\n", 1);
> +		fprintf(f,
> +			"  --vhost-user		Enable vhost-user mode\n"
> +			"    UNIX domain socket is provided by -s option\n"
> +			"  --print-capabilities	print back-end capabilities in JSON format,\n"
> +			"    only meaningful for vhost-user mode\n");
>  	}
>  
>  	fprintf(f,
> @@ -1290,6 +1296,10 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"map-host-loopback", required_argument, NULL,		21 },
>  		{"map-guest-addr", required_argument,	NULL,		22 },
>  		{"dns-host",	required_argument,	NULL,		24 },
> +		{"vhost-user",	no_argument,		NULL,		25 },
> +		/* vhost-user backend program convention */
> +		{"print-capabilities", no_argument,	NULL,		26 },
> +		{"socket-path",	required_argument,	NULL,		's' },
>  		{ 0 },
>  	};
>  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1478,6 +1488,15 @@ void conf(struct ctx *c, int argc, char **argv)
>  				break;
>  
>  			die("Invalid host nameserver address: %s", optarg);
> +		case 25:
> +			if (c->mode == MODE_PASTA) {
> +				err("--vhost-user is for passt mode only");
> +				usage(argv[0], stdout, EXIT_SUCCESS);
> +			}
> +			c->mode = MODE_VU;
> +			break;
> +		case 26:
> +			vu_print_capabilities();
>  			break;
>  		case 'd':
>  			c->debug = 1;
> diff --git a/epoll_type.h b/epoll_type.h
> index 0ad1efa0ccec..f3ef41584757 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -36,6 +36,10 @@ enum epoll_type {
>  	EPOLL_TYPE_TAP_PASST,
>  	/* socket listening for qemu socket connections */
>  	EPOLL_TYPE_TAP_LISTEN,
> +	/* vhost-user command socket */
> +	EPOLL_TYPE_VHOST_CMD,
> +	/* vhost-user kick event socket */
> +	EPOLL_TYPE_VHOST_KICK,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/iov.c b/iov.c
> index 3f9e229a305f..3741db21790f 100644
> --- a/iov.c
> +++ b/iov.c
> @@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
>   *
>   * Returns:    The number of bytes successfully copied.
>   */
> -/* cppcheck-suppress unusedFunction */
>  size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
>  		    size_t offset, const void *buf, size_t bytes)
>  {
> diff --git a/isolation.c b/isolation.c
> index 45fba1e68b9d..c2a3c7b7911d 100644
> --- a/isolation.c
> +++ b/isolation.c
> @@ -379,12 +379,19 @@ void isolate_postfork(const struct ctx *c)
>  
>  	prctl(PR_SET_DUMPABLE, 0);
>  
> -	if (c->mode == MODE_PASTA) {
> -		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> -		prog.filter = filter_pasta;
> -	} else {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
>  		prog.filter = filter_passt;
> +		break;
> +	case MODE_PASTA:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> +		prog.filter = filter_pasta;
> +		break;
> +	case MODE_VU:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
> +		prog.filter = filter_vu;
> +		break;

I'd feel more comfortable with a default case that calls die() or
ASSERT().  Obviously that shouldn't happen, but not applying any
filter doesn't seem like a good way to fail if somehow we get an
unexpected mode.

>  	}
>  
>  	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
> diff --git a/packet.c b/packet.c
> index 37489961a37e..e5a78d079231 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -36,6 +36,17 @@
>  static int packet_check_range(const struct pool *p, size_t offset, size_t len,
>  			      const char *start, const char *func, int line)
>  {
> +	if (p->buf_size == 0) {
> +		int ret;
> +
> +		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
> +
> +		if (ret == -1)
> +			trace("cannot find region, %s:%i", func, line);
> +
> +		return ret;
> +	}
> +
>  	if (start < p->buf) {
>  		trace("packet start %p before buffer start %p, "
>  		      "%s:%i", (void *)start, (void *)p->buf, func, line);
> diff --git a/packet.h b/packet.h
> index 8377dcf678bb..3f70e949c066 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -8,8 +8,10 @@
>  
>  /**
>   * struct pool - Generic pool of packets stored in a buffer
> - * @buf:	Buffer storing packet descriptors
> - * @buf_size:	Total size of buffer
> + * @buf:	Buffer storing packet descriptors,
> + * 		a struct vu_dev_region array for passt vhost-user mode
> + * @buf_size:	Total size of buffer,
> + * 		0 for passt vhost-user mode
>   * @size:	Number of usable descriptors for the pool
>   * @count:	Number of used descriptors for the pool
>   * @pkt:	Descriptors: see macros below
> @@ -22,6 +24,8 @@ struct pool {
>  	struct iovec pkt[1];
>  };
>  
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start);
>  void packet_add_do(struct pool *p, size_t len, const char *start,
>  		   const char *func, int line);
>  void *packet_get_do(const struct pool *p, const size_t idx,
> diff --git a/passt.1 b/passt.1
> index ef33267e9cd7..96532dd39aa2 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -397,12 +397,20 @@ interface address are configured on a given host interface.
>  .SS \fBpasst\fR-only options
>  
>  .TP
> -.BR \-s ", " \-\-socket " " \fIpath
> +.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath
>  Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to connect to
>  \fBpasst\fR.
>  Default is to probe a free socket, not accepting connections, starting from
>  \fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR.
>  
> +.TP
> +.BR \-\-vhost-user
> +Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
> +
> +.TP
> +.BR \-\-print-capabilities
> +Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
> +
>  .TP
>  .BR \-F ", " \-\-fd " " \fIFD
>  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> diff --git a/passt.c b/passt.c
> index 79093ee02d62..2d105e81218d 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -52,6 +52,7 @@
>  #include "arch.h"
>  #include "log.h"
>  #include "tcp_splice.h"
> +#include "vu_common.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -74,6 +75,8 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_TAP_PASTA]		= "/dev/net/tun device",
>  	[EPOLL_TYPE_TAP_PASST]		= "connected qemu socket",
>  	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
> +	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
> +	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -360,6 +363,12 @@ loop:
>  		case EPOLL_TYPE_PING:
>  			icmp_sock_handler(&c, ref);
>  			break;
> +		case EPOLL_TYPE_VHOST_CMD:
> +			vu_control_handler(c.vdev, c.fd_tap, eventmask);
> +			break;
> +		case EPOLL_TYPE_VHOST_KICK:
> +			vu_kick_cb(c.vdev, ref, &now);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 4908ed937dc8..311482d36257 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -25,6 +25,8 @@ union epoll_ref;
>  #include "fwd.h"
>  #include "tcp.h"
>  #include "udp.h"
> +#include "udp_vu.h"

Why does udp_vu.h need to be included here?

> +#include "vhost_user.h"
>  
>  /* Default address for our end on the tap interface.  Bit 0 of byte 0 must be 0
>   * (unicast) and bit 1 of byte 1 must be 1 (locally administered).  Otherwise
> @@ -94,6 +96,7 @@ struct fqdn {
>  enum passt_modes {
>  	MODE_PASST,
>  	MODE_PASTA,
> +	MODE_VU,
>  };
>  
>  /**
> @@ -228,6 +231,7 @@ struct ip6_ctx {
>   * @freebind:		Allow binding of non-local addresses for forwarding
>   * @low_wmem:		Low probed net.core.wmem_max
>   * @low_rmem:		Low probed net.core.rmem_max
> + * @vdev:		vhost-user device
>   */
>  struct ctx {
>  	enum passt_modes mode;
> @@ -289,6 +293,8 @@ struct ctx {
>  
>  	int low_wmem;
>  	int low_rmem;
> +
> +	struct vu_dev *vdev;
>  };
>  
>  void proto_update_l2_buf(const unsigned char *eth_d,
> diff --git a/pcap.c b/pcap.c
> index 6ee6cdfd261a..718d6ad61732 100644
> --- a/pcap.c
> +++ b/pcap.c
> @@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
>   * @iovcnt:	Number of buffers (@iov entries)
>   * @offset:	Offset of the L2 frame within the full data length
>   */
> -/* cppcheck-suppress unusedFunction */
>  void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
>  {
>  	struct timespec now;
> diff --git a/tap.c b/tap.c
> index 4b826fdf7adc..22d19f1833f7 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -58,6 +58,8 @@
>  #include "packet.h"
>  #include "tap.h"
>  #include "log.h"
> +#include "vhost_user.h"
> +#include "vu_common.h"
>  
>  /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
>  static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
> @@ -78,16 +80,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
>  	struct iovec iov[2];
>  	size_t iovcnt = 0;
>  
> -	if (c->mode == MODE_PASST) {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
>  		iovcnt++;
> -	}
> -
> -	iov[iovcnt].iov_base = (void *)data;
> -	iov[iovcnt].iov_len = l2len;
> -	iovcnt++;
> +		/* fall through */
> +	case MODE_PASTA:
> +		iov[iovcnt].iov_base = (void *)data;
> +		iov[iovcnt].iov_len = l2len;
> +		iovcnt++;
>  
> -	tap_send_frames(c, iov, iovcnt, 1);
> +		tap_send_frames(c, iov, iovcnt, 1);
> +		break;
> +	case MODE_VU:
> +		vu_send_single(c, data, l2len);
> +		break;
> +	}
>  }
>  
>  /**
> @@ -414,10 +422,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  	if (!nframes)
>  		return 0;
>  
> -	if (c->mode == MODE_PASTA)
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
> -	else
> +		break;
> +	case MODE_PASST:
>  		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
> +		break;
> +	case MODE_VU:
> +		/* fall through */
> +	default:
> +		ASSERT(0);
> +	}
>  
>  	if (m < nframes)
>  		debug("tap: failed to send %zu frames of %zu",
> @@ -976,7 +992,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
>   * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>   * @c:		Execution context
>   */
> -static void tap_sock_reset(struct ctx *c)
> +void tap_sock_reset(struct ctx *c)
>  {
>  	info("Client connection closed%s", c->one_off ? ", exiting" : "");
>  
> @@ -987,6 +1003,8 @@ static void tap_sock_reset(struct ctx *c)
>  	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
>  	close(c->fd_tap);
>  	c->fd_tap = -1;
> +	if (c->mode == MODE_VU)
> +		vu_cleanup(c->vdev);
>  }
>  
>  /**
> @@ -1205,6 +1223,11 @@ static void tap_backend_show_hints(struct ctx *c)
>  		info("or qrap, for earlier qemu versions:");
>  		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
>  		break;
> +	case MODE_VU:
> +		info("You can start qemu with:");
> +		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
> +		     c->sock_path);
> +		break;
>  	}
>  }
>  
> @@ -1232,8 +1255,8 @@ static void tap_sock_unix_init(const struct ctx *c)
>   */
>  void tap_listen_handler(struct ctx *c, uint32_t events)
>  {
> -	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
>  	struct epoll_event ev = { 0 };
> +	union epoll_ref ref = { 0 };
>  	int v = INT_MAX / 2;
>  	struct ucred ucred;
>  	socklen_t len;
> @@ -1273,6 +1296,10 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  		trace("tap: failed to set SO_SNDBUF to %i", v);
>  
>  	ref.fd = c->fd_tap;
> +	if (c->mode == MODE_VU)
> +		ref.type = EPOLL_TYPE_VHOST_CMD;
> +	else
> +		ref.type = EPOLL_TYPE_TAP_PASST;
>  	ev.events = EPOLLIN | EPOLLRDHUP;
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> @@ -1339,7 +1366,7 @@ static void tap_sock_tun_init(struct ctx *c)
>   * @base:	Buffer base
>   * @size	Buffer size
>   */
> -static void tap_sock_update_pool(void *base, size_t size)
> +void tap_sock_update_pool(void *base, size_t size)
>  {
>  	int i;
>  
> @@ -1353,13 +1380,15 @@ static void tap_sock_update_pool(void *base, size_t size)
>  }
>  
>  /**
> - * tap_backend_init() - Create and set up AF_UNIX socket or
> - *			tuntap file descriptor
> + * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor

I'm guessing you didn't mean to revert the name change in the comment
here.

>   * @c:		Execution context
>   */
>  void tap_backend_init(struct ctx *c)
>  {
> -	tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
> +	if (c->mode == MODE_VU)
> +		tap_sock_update_pool(NULL, 0);
> +	else
> +		tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
>  
>  	if (c->fd_tap != -1) { /* Passed as --fd */
>  		struct epoll_event ev = { 0 };
> @@ -1367,10 +1396,17 @@ void tap_backend_init(struct ctx *c)
>  
>  		ASSERT(c->one_off);
>  		ref.fd = c->fd_tap;
> -		if (c->mode == MODE_PASST)
> +		switch (c->mode) {
> +		case MODE_PASST:
>  			ref.type = EPOLL_TYPE_TAP_PASST;
> -		else
> +			break;
> +		case MODE_PASTA:
>  			ref.type = EPOLL_TYPE_TAP_PASTA;
> +			break;
> +		case MODE_VU:
> +			ref.type = EPOLL_TYPE_VHOST_CMD;
> +			break;
> +		}
>  
>  		ev.events = EPOLLIN | EPOLLRDHUP;
>  		ev.data.u64 = ref.u64;
> @@ -1378,9 +1414,14 @@ void tap_backend_init(struct ctx *c)
>  		return;
>  	}
>  
> -	if (c->mode == MODE_PASTA) {
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		tap_sock_tun_init(c);
> -	} else {
> +		break;
> +	case MODE_VU:
> +		vu_init(c);
> +		/* fall through */
> +	case MODE_PASST:
>  		tap_sock_unix_init(c);
>  
>  		/* In passt mode, we don't know the guest's MAC address until it
> @@ -1388,6 +1429,7 @@ void tap_backend_init(struct ctx *c)
>  		 * first packets will reach it.
>  		 */
>  		memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
> +		break;
>  	}
>  
>  	tap_backend_show_hints(c);
> diff --git a/tap.h b/tap.h
> index 8728cc5c09c3..dfbd8b9ebd72 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
>   */
>  static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
>  {
> -	thdr->vnet_len = htonl(l2len);
> +	if (thdr)
> +		thdr->vnet_len = htonl(l2len);

So.. I do think you could treat the virtio_net_hdr_mrg_rxbuf structure
as the "tap" header for vhost-user, though it will require some
refactoring.  I think that could simplify a bunch of iov juggling
further on.

Obviously, this function would need more parameters: at least ctx for
the mode and the number of (payload) buffers.  The thdr parameter
would need to become a void * or similar too.

Note that this function is already conceptually dependent on the mode:
for passt it needs to fill in the vnet_len, for pasta it doesn't need
to do anything.  For now it fills in the vnet_len in both cases just
for simplicity because it's harmless though pointless in the pasta
case.

More comments further down, where I think that approach will simplify
things a bit.

>  }
>  
>  void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
> @@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  int tap_sock_unix_open(char *sock_path);
> +void tap_sock_reset(struct ctx *c);
> +void tap_sock_update_pool(void *base, size_t size);
>  void tap_backend_init(struct ctx *c);
>  void tap_flush_pools(void);
>  void tap_handler(struct ctx *c, const struct timespec *now);
> diff --git a/tcp.c b/tcp.c
> index eae02b1647e3..fd2def0d8a39 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -304,6 +304,7 @@
>  #include "flow_table.h"
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
> +#include "tcp_vu.h"
>  
>  /* MSS rounding: see SET_MSS() */
>  #define MSS_DEFAULT			536
> @@ -1328,6 +1329,9 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
>  static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
>  			 int flags)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_send_flag(c, conn, flags);
> +
>  	return tcp_buf_send_flag(c, conn, flags);
>  }
>  
> @@ -1721,6 +1725,9 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
>   */
>  static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_data_from_sock(c, conn);
> +
>  	return tcp_buf_data_from_sock(c, conn);
>  }
>  
> diff --git a/tcp_vu.c b/tcp_vu.c
> new file mode 100644
> index 000000000000..1126fb39d138
> --- /dev/null
> +++ b/tcp_vu.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* tcp_vu.c - TCP L2 vhost-user management functions
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#include <errno.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +#include <netinet/ip.h>
> +
> +#include <sys/socket.h>
> +
> +#include <linux/tcp.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "vhost_user.h"
> +#include "tcp.h"
> +#include "pcap.h"
> +#include "flow.h"
> +#include "tcp_conn.h"
> +#include "flow_table.h"
> +#include "tcp_vu.h"
> +#include "tap.h"
> +#include "tcp_internal.h"
> +#include "checksum.h"
> +#include "vu_common.h"
> +
> +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
> +static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +
> +/**
> + * tcp_vu_l2_hdrlen() - return the size of the header in level 2 frame (TDP)

I don't love this name. To be meaningful "header length" needs to say
both the first and last header that's relevant here, so "l2" doesn't
really disambiguate anything.  I'd suggest just "tcp_vu_hdrlen()" then
in the longer comment say it's the total length of the L2, L3 & L4
headers.

Although... looking deeper it seems like most (all?) times you use the
output from this you add the size of virtio_net_hdr_mrg_rxbuf to it
anyway.  So would it be simpler to just include that here in the first
place.

Also s/TDP/TCP/?

> + * @v6:		Set for IPv6 packet
> + *
> + * Return: Return the size of the header
> + */
> +static size_t tcp_vu_l2_hdrlen(bool v6)
> +{
> +	size_t l2_hdrlen;
> +
> +	l2_hdrlen = sizeof(struct ethhdr) + sizeof(struct tcphdr);
> +
> +	if (v6)
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	else
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +	return l2_hdrlen;
> +}
> +
> +/**
> + * tcp_vu_update_check() - Calculate TCP checksum
> + * @tapside:	Address information for one side of the flow
> + * @iov:	Pointer to the array of IO vectors
> + * @iov_used:	Length of the array
> + */
> +static void tcp_vu_update_check(const struct flowside *tapside,
> +			        struct iovec *iov, int iov_used)

AFAICT this is only used for the pcap path.  Rather than filling in
the checksum at a different point from normal, I think it would be
easier to just clear the no_tcp_csum flag when pcap is enabled.  That
would, AFAICT, remove the need for this function entirely.

> +{
> +	char *base = iov[0].iov_base;
> +
> +	if (inany_v4(&tapside->oaddr)) {
> +		const struct iphdr *iph = vu_ip(base);
> +
> +		tcp_update_check_tcp4(iph, iov, iov_used,
> +				      (char *)vu_payloadv4(base) - base);
> +	} else {
> +		const struct ipv6hdr *ip6h = vu_ip(base);
> +
> +		tcp_update_check_tcp6(ip6h, iov, iov_used,
> +				      (char *)vu_payloadv6(base) - base);
> +	}
> +}
> +
> +/**
> + * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload)
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @flags:	TCP flags: if not set, send segment only if ACK is due
> + *
> + * Return: negative error code on connection reset, 0 otherwise
> + */
> +int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	size_t l2len, l4len, optlen, hdrlen;
> +	struct ethhdr *eh;
> +	int elem_cnt;
> +	int nb_ack;
> +	int ret;
> +
> +	hdrlen = tcp_vu_l2_hdrlen(CONN_V6(conn));
> +
> +	vu_init_elem(elem, iov_vu, 2);

I'm finding the use of iov_vu here confusing.  AFAICT when
transferring data from sock to vu, it represents the "sock side" IOV -
so it points to just the pieces of VU memory that are used for the
actual data payloads.  Here, where there's no socket side data, I'm
not sure what it's supposed to mean.

I think it might be better to just use a local 2-element iovec.

> +	elem_cnt = vu_collect_one_frame(vdev, vq, elem, 1,
> +					hdrlen + OPT_MSS_LEN + OPT_WS_LEN + 1,
> +					0, NULL);

AFAICT the size passed to vu_collect_one_frame() doesn't include the
headers, but here you do include it.

Also, IIUC, this carefully removes space for the headers from the
IOV...

> +	if (elem_cnt < 1)
> +		return 0;
> +
> +	vu_set_vnethdr(vdev, &iov_vu[0], 1, 0);

... then this puts it right back again.  That also seems confusing.

> +
> +	eh = vu_eth(iov_vu[0].iov_base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	if (CONN_V4(conn)) {
> +		struct tcp_payload_t *payload;
> +		struct iphdr *iph;
> +		uint32_t seq;
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = vu_ip(iov_vu[0].iov_base);
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv4(iov_vu[0].iov_base);

I'm not sure using tcp_payload_t is a good idea here, since it kind of
implies a buffer for a max-size frame, which is not necessary and
might not be the case here.  tcp_flags_t might make more sense.

> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
> +		payload->th.ack = 1;
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
> +					(char *)payload->data, &optlen);

vu_collect_one_frame() verifies that the fixed size portion of the
headers are contiguous and in the first buffer.  However, here, you're
also requiring that there's contiguous space in that first buffer for
the TCP header options as well.  That, too, needs to be verified.

> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_fill_headers4(conn, NULL, iph, payload, optlen,
> +					  NULL, seq, true);
> +		l2len = sizeof(*iph);
> +	} else {
> +		struct tcp_payload_t *payload;
> +		struct ipv6hdr *ip6h;
> +		uint32_t seq;
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = vu_ip(iov_vu[0].iov_base);
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv6(iov_vu[0].iov_base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
> +		payload->th.ack = 1;
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
> +					(char *)payload->data, &optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_fill_headers6(conn, NULL, ip6h, payload, optlen,
> +					  seq, true);
> +		l2len = sizeof(*ip6h);
> +	}
> +	l2len += l4len + sizeof(struct ethhdr);
> +
> +	elem[0].in_sg[0].iov_len = l2len +
> +				   sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	if (*c->pcap) {
> +		tcp_vu_update_check(tapside, &elem[0].in_sg[0], 1);
> +		pcap_iov(&elem[0].in_sg[0], 1,
> +			 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +	}
> +	nb_ack = 1;
> +
> +	if (flags & DUP_ACK) {
> +		elem_cnt = vu_collect_one_frame(vdev, vq, &elem[1], 1, l2len,
> +						0, NULL);
> +		if (elem_cnt == 1) {
> +			memcpy(elem[1].in_sg[0].iov_base,
> +			       elem[0].in_sg[0].iov_base, l2len);

IIUC, the iovs in in_sg will include the virtio_net_hdr_mrg_rxbuf, but
l2len does not include it, so I think this might fail to copy the tail
of the packet.

> +			vu_set_vnethdr(vdev, &elem[1].in_sg[0], 1, 0);
> +			nb_ack++;
> +
> +			if (*c->pcap)
> +				pcap_iov(&elem[1].in_sg[0], 1, 0);
> +		}
> +	}
> +
> +	vu_flush(vdev, vq, elem, nb_ack);

Is there a VU specific reason to flush immediately, rather than
processing more stuff and flushing everything in the deferred handler?

> +	return 0;
> +}
> +
> +/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user buffers
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @v4:		Set for IPv4 connections

Most places we have a flag for v6 rather than a flag for v4, so maybe
invert this for consistency?

> + * @fillsize:	Number of bytes we can receive
> + * @datalen:	Size of received data (output)

s/datalen/dlen. ?

> + *
> + * Return: Number of iov entries used to store the data
> + */
> +static ssize_t tcp_vu_sock_recv(const struct ctx *c,
> +				struct tcp_tap_conn *conn, bool v4,
> +				size_t fillsize, ssize_t *dlen)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct msghdr mh_sock = { 0 };
> +	uint16_t mss = MSS_GET(conn);
> +	int s = conn->sock;
> +	size_t l2_hdrlen;
> +	int elem_cnt;
> +	ssize_t ret;
> +
> +	*dlen = 0;
> +
> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> +
> +	vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
> +
> +	elem_cnt = vu_collect(vdev, vq, elem, VIRTQUEUE_MAX_SIZE, mss,
> +			      l2_hdrlen, fillsize);
> +	if (elem_cnt < 0) {
> +		tcp_rst(c, conn);
> +		return -ENOMEM;

Does this warrant a reset?  Couldn't this happen due to a temporary
shortage of receive buffers on the guest side, in which case we'd want
to just resend later.

> +	}
> +
> +	mh_sock.msg_iov = iov_vu;
> +	mh_sock.msg_iovlen = elem_cnt + 1;
> +
> +	do
> +		ret = recvmsg(s, &mh_sock, MSG_PEEK);
> +	while (ret < 0 && errno == EINTR);
> +
> +	if (ret < 0) {
> +		vu_queue_rewind(vq, elem_cnt);
> +		if (errno != EAGAIN && errno != EWOULDBLOCK) {

I'm assuming this is supposed to be the errno from recvmsg()?  It's
generally not a good idea to check errno after calling any other
functions, because it's so easy to clobber accidentally.

> +			ret = -errno;
> +			tcp_rst(c, conn);
> +		}
> +		return ret;
> +	}
> +	if (!ret) {
> +		vu_queue_rewind(vq, elem_cnt);
> +
> +		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> +			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);
> +			if (retf) {
> +				tcp_rst(c, conn);
> +				return retf;
> +			}
> +
> +			conn_event(c, conn, TAP_FIN_SENT);
> +		}
> +		return 0;
> +	}
> +
> +	*dlen = ret;
> +
> +	return elem_cnt;
> +}
> +
> +/**
> + * tcp_vu_prepare() - Prepare the packet header
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @first:	Pointer to the array of IO vectors
> + * @dlen:	Packet data length
> + * @check:	Checksum, if already known
> + */
> +static void tcp_vu_prepare(const struct ctx *c,
> +			   struct tcp_tap_conn *conn, struct iovec *first,
> +			   size_t dlen, const uint16_t **check)
> +{
> +	const struct flowside *toside = TAPFLOW(conn);
> +	char *base = first->iov_base;
> +	struct ethhdr *eh;
> +
> +	/* we guess the first iovec provided by the guest can embed
> +	 * all the headers needed by L2 frame
> +	 */
> +
> +	eh = vu_eth(base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	/* initialize header */
> +	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
> +		struct tcp_payload_t *payload;
> +		struct iphdr *iph;
> +
> +		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
> +		       sizeof(struct ethhdr) + sizeof(struct iphdr) +
> +		       sizeof(struct tcphdr));
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = vu_ip(base);
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +		payload = vu_payloadv4(base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
> +		payload->th.ack = 1;
> +
> +		tcp_fill_headers4(conn, NULL, iph, payload, dlen,
> +				  *check, conn->seq_to_tap, true);

I wonder if there's another opportunity for more logic sharing between
the buf and vu paths here: tcp_fill_headers[46]() fills in the dynamic
(vary per packet) parts of the header.  Could we add a function that
fills in the static (ish) parts.  VU would call it on every packet,
buf would call it during buffer setup.

> +		*check = &iph->check;
> +	} else {
> +		struct tcp_payload_t *payload;
> +		struct ipv6hdr *ip6h;
> +
> +		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
> +		       sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
> +		       sizeof(struct tcphdr));
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = vu_ip(base);
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv6(base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
> +		payload->th.ack = 1;
> +
> +		tcp_fill_headers6(conn, NULL, ip6h, payload, dlen,
> +				  conn->seq_to_tap, true);
> +	}
> +}
> +
> +/**
> + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
> + *			     in window
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + *
> + * Return: Negative on connection reset, 0 otherwise
> + */
> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> +{
> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	uint16_t mss = MSS_GET(conn);
> +	size_t l2_hdrlen, fillsize;
> +	int i, iov_cnt, iov_used;
> +	int v4 = CONN_V4(conn);
> +	uint32_t already_sent = 0;
> +	const uint16_t *check;
> +	struct iovec *first;
> +	int frame_size;
> +	int num_buffers;
> +	ssize_t len;
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {

Under what circumstances could we get here?  I'm wondering if we risk either:
  1. Spinning on a socket EPOLLIN event until the virtqueue is ready
or
  2. Essentially losing the event here due to EPOLLET, and never
     coming back to read the data queued on the socket

> +		flow_err(conn,
> +			 "Got packet, but RX virtqueue not usable yet");
> +		return 0;
> +	}
> +
> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> +
> +	if (SEQ_LT(already_sent, 0)) {
> +		/* RFC 761, section 2.1. */
> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
> +		conn->seq_to_tap = conn->seq_ack_from_tap;
> +		already_sent = 0;
> +	}
> +
> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
> +		conn_flag(c, conn, STALLED);
> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +		return 0;
> +	}
> +
> +	/* Set up buffer descriptors we'll fill completely and partially. */

This seems tantalizingly close to allowing more code to be shared
between the vu and other paths:
	1. Common already_sent calculations
	2. Backend specific code constructs an mh array
	3. Common code does the recvmsg() and follow up event / flag handling
	4. Backend specific code forwards to tap

The need for vu_queue_rewind()s complicates things a bit, but I wonder
if we can work around that.

> +
> +	fillsize = wnd_scaled;
> +
> +	if (peek_offset_cap)
> +		already_sent = 0;
> +
> +	iov_vu[0].iov_base = tcp_buf_discard;
> +	iov_vu[0].iov_len = already_sent;
> +	fillsize -= already_sent;
> +
> +	/* collect the buffers from vhost-user and fill them with the
> +	 * data from the socket
> +	 */
> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
> +	if (iov_cnt <= 0)
> +		return iov_cnt;
> +
> +	len -= already_sent;
> +	if (len <= 0) {
> +		conn_flag(c, conn, STALLED);
> +		vu_queue_rewind(vq, iov_cnt);
> +		return 0;
> +	}
> +
> +	conn_flag(c, conn, ~STALLED);
> +
> +	/* Likely, some new data was acked too. */
> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
> +
> +	/* initialize headers */
> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> +	iov_used = 0;
> +	num_buffers = 0;
> +	check = NULL;
> +	frame_size = 0;
> +
> +	/* iov_vu is an array of buffers and the buffer size can be
> +	 * smaller than the frame size we want to use but with
> +	 * num_buffer we can merge several virtio iov buffers in one packet
> +	 * we need only to set the packet headers in the first iov and
> +	 * num_buffer to the number of iov entries
> +	 */
> +	for (i = 0; i < iov_cnt && len; i++) {

AFAICT, you always use (i + 1) never bare i in the body, so it would
probably be simpler to change the loop to be from i=1 to i <= iov_cnt.

> +
> +		if (frame_size == 0)
> +			first = &iov_vu[i + 1];
> +
> +		if (iov_vu[i + 1].iov_len > (size_t)len)
> +			iov_vu[i + 1].iov_len = len;
> +
> +		len -= iov_vu[i + 1].iov_len;
> +		iov_used++;
> +
> +		frame_size += iov_vu[i + 1].iov_len;
> +		num_buffers++;
> +
> +		if (frame_size >= mss || len == 0 ||
> +		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +			if (i + 1 == iov_cnt)
> +				check = NULL;
> +
> +			/* restore first iovec base: point to vnet header */
> +			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
> +
> +			tcp_vu_prepare(c, conn, first, frame_size, &check);
> +			if (*c->pcap)  {
> +				tcp_vu_update_check(tapside, first, num_buffers);
> +				pcap_iov(first, num_buffers,
> +					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +			}
> +
> +			conn->seq_to_tap += frame_size;
> +
> +			frame_size = 0;
> +			num_buffers = 0;
> +		}
> +	}
> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> +
> +	/* send packets */
> +	vu_flush(vdev, vq, elem, iov_used);
> +
> +	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +
> +	return 0;
> +}
> diff --git a/tcp_vu.h b/tcp_vu.h
> new file mode 100644
> index 000000000000..6ab6057f352a
> --- /dev/null
> +++ b/tcp_vu.h
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef TCP_VU_H
> +#define TCP_VU_H
> +
> +int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
> +
> +#endif  /*TCP_VU_H */
> diff --git a/udp.c b/udp.c
> index 8fc5d8099310..1171d9d1a75b 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -628,6 +628,11 @@ void udp_listen_sock_handler(const struct ctx *c,
>  			     union epoll_ref ref, uint32_t events,
>  			     const struct timespec *now)
>  {
> +	if (c->mode == MODE_VU) {
> +		udp_vu_listen_sock_handler(c, ref, events, now);
> +		return;
> +	}
> +
>  	udp_buf_listen_sock_handler(c, ref, events, now);
>  }
>  
> @@ -697,6 +702,11 @@ static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
>  void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
>  			    uint32_t events, const struct timespec *now)
>  {
> +	if (c->mode == MODE_VU) {
> +		udp_vu_reply_sock_handler(c, ref, events, now);
> +		return;
> +	}
> +
>  	udp_buf_reply_sock_handler(c, ref, events, now);
>  }
>  
> diff --git a/udp_vu.c b/udp_vu.c
> new file mode 100644
> index 000000000000..3cb76945c9c1
> --- /dev/null
> +++ b/udp_vu.c
> @@ -0,0 +1,336 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* udp_vu.c - UDP L2 vhost-user management functions
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#include <unistd.h>
> +#include <assert.h>
> +#include <net/ethernet.h>
> +#include <net/if.h>
> +#include <netinet/in.h>
> +#include <netinet/ip.h>
> +#include <netinet/udp.h>
> +#include <stdint.h>
> +#include <stddef.h>
> +#include <sys/uio.h>
> +#include <linux/virtio_net.h>
> +
> +#include "checksum.h"
> +#include "util.h"
> +#include "ip.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "passt.h"
> +#include "pcap.h"
> +#include "log.h"
> +#include "vhost_user.h"
> +#include "udp_internal.h"
> +#include "flow.h"
> +#include "flow_table.h"
> +#include "udp_flow.h"
> +#include "udp_vu.h"
> +#include "vu_common.h"
> +
> +static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
> +static struct vu_virtq_element	elem		[VIRTQUEUE_MAX_SIZE];
> +
> +/**
> + * udp_vu_l2_hdrlen() - return the size of the header in level 2 frame (UDP)
> + * @v6:		Set for IPv6 packet
> + *
> + * Return: Return the size of the header
> + */
> +static size_t udp_vu_l2_hdrlen(bool v6)
> +{
> +	size_t l2_hdrlen;
> +
> +	l2_hdrlen = sizeof(struct ethhdr) + sizeof(struct udphdr);
> +
> +	if (v6)
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	else
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +	return l2_hdrlen;
> +}
> +
> +static int udp_vu_sock_init(int s, union sockaddr_inany *s_in)
> +{
> +	struct msghdr msg = {
> +		.msg_name = s_in,
> +		.msg_namelen = sizeof(union sockaddr_inany),
> +	};
> +
> +	return recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
> +}
> +
> +/**
> + * udp_vu_sock_recv() - Receive datagrams from socket into vhost-user buffers
> + * @c:		Execution context
> + * @s:		Socket to receive from
> + * @events:	epoll events bitmap
> + * @v6:		Set for IPv6 connections
> + * @datalen:	Size of received data (output)
> + *
> + * Return: Number of iov entries used to store the datagram
> + */
> +static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events,
> +			    bool v6, ssize_t *dlen)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int max_elem, iov_cnt, idx, iov_used;
> +	struct msghdr msg  = { 0 };
> +	size_t off, l2_hdrlen;
> +
> +	ASSERT(!c->no_udp);
> +
> +	if (!(events & EPOLLIN))
> +		return 0;
> +
> +	/* compute L2 header length */
> +
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		max_elem = VIRTQUEUE_MAX_SIZE;
> +	else
> +		max_elem = 1;
> +
> +	l2_hdrlen = udp_vu_l2_hdrlen(v6);
> +
> +	vu_init_elem(elem, iov_vu, max_elem);
> +
> +	iov_cnt = vu_collect_one_frame(vdev, vq, elem, max_elem,
> +			      ETH_MAX_MTU - l2_hdrlen,
> +			      l2_hdrlen, NULL);
> +	if (iov_cnt == 0)
> +		return 0;
> +
> +	msg.msg_iov = iov_vu;
> +	msg.msg_iovlen = iov_cnt;
> +
> +	*dlen = recvmsg(s, &msg, 0);
> +	if (*dlen < 0) {
> +		vu_queue_rewind(vq, iov_cnt);
> +		return 0;
> +	}

It'd be nice if we could use recvmmsg() here.  I think it's not super
hard, with a preliminary step a bit like the vu_collect() in the TCP
code.  That would gather a bunch of buffers from the queue into an
IOV, then construct an array of msghdrs each with a slice of that IO
vector large enough to hold a max size frame.

It would be reasonable to postpone that after the initial version
though.

> +	/* count the numbers of buffer filled by recvmsg() */
> +	idx = iov_skip_bytes(iov_vu, iov_cnt, *dlen, &off);
> +
> +	/* adjust last iov length */
> +	if (idx < iov_cnt)
> +		iov_vu[idx].iov_len = off;
> +	iov_used = idx + !!off;
> +
> +	/* we have at least the header */
> +	if (iov_used == 0)
> +		iov_used = 1;
> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> +
> +	vu_set_vnethdr(vdev, &iov_vu[0], iov_used, l2_hdrlen);
> +
> +	return iov_used;
> +}
> +
> +/**
> + * udp_vu_prepare() - Prepare the packet header
> + * @c:		Execution context
> + * @toside:	Address information for one side of the flow
> + * @datalen:	Packet data length
> + *
> + * Return: Layer-4 length
> + */
> +static size_t udp_vu_prepare(const struct ctx *c,
> +			     const struct flowside *toside, ssize_t dlen)
> +{
> +	struct ethhdr *eh;
> +	size_t l4len;
> +
> +	/* ethernet header */
> +	eh = vu_eth(iov_vu[0].iov_base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	/* initialize header */
> +	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
> +		struct iphdr *iph = vu_ip(iov_vu[0].iov_base);
> +		struct udp_payload_t *bp = vu_payloadv4(iov_vu[0].iov_base);
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
> +
> +		l4len = udp_update_hdr4(iph, bp, toside, dlen, true);
> +	} else {
> +		struct ipv6hdr *ip6h = vu_ip(iov_vu[0].iov_base);
> +		struct udp_payload_t *bp = vu_payloadv6(iov_vu[0].iov_base);
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
> +
> +		l4len = udp_update_hdr6(ip6h, bp, toside, dlen, true);
> +	}
> +
> +	return l4len;
> +}
> +
> +/**
> + * udp_vu_csum() - Calculate and set checksum for a UDP packet
> + * @toside:	ddress information for one side of the flow
> + * @l4len:	IPv4 Payload length
> + * @iov_used:	Length of the array
> + */
> +static void udp_vu_csum(const struct flowside *toside, int iov_used)
> +{
> +	const struct in_addr *src4 = inany_v4(&toside->oaddr);
> +	const struct in_addr *dst4 = inany_v4(&toside->eaddr);
> +	char *base = iov_vu[0].iov_base;
> +	struct udp_payload_t *bp;
> +
> +	if (src4 && dst4) {
> +		bp = vu_payloadv4(base);
> +		csum_udp4(&bp->uh, *src4, *dst4, iov_vu, iov_used,
> +			  (char *)&bp->data - base);
> +	} else {
> +		bp = vu_payloadv6(base);
> +		csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6,
> +			  iov_vu, iov_used, (char *)&bp->data - base);
> +	}
> +}
> +
> +/**
> + * udp_vu_listen_sock_handler() - Handle new data from socket
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @events:	epoll events bitmap
> + * @now:	Current timestamp
> + */
> +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int i;
> +
> +	if (udp_sock_errs(c, ref.fd, events) < 0) {
> +		err("UDP: Unrecoverable error on listening socket:"
> +		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
> +		return;
> +	}
> +
> +	for (i = 0; i < UDP_MAX_FRAMES; i++) {
> +		const struct flowside *toside;
> +		union sockaddr_inany s_in;
> +		flow_sidx_t batchsidx;
> +		uint8_t batchpif;

These names are a bit misleading in their new context.  Since you're
processing one datagram at a time, there are no "batches".

> +		ssize_t dlen;
> +		int iov_used;
> +		bool v6;
> +
> +		if (udp_vu_sock_init(ref.fd, &s_in) < 0)
> +			break;
> +
> +		batchsidx = udp_flow_from_sock(c, ref, &s_in, now);
> +		batchpif = pif_at_sidx(batchsidx);
> +
> +		if (batchpif != PIF_TAP) {
> +			if (flow_sidx_valid(batchsidx)) {
> +				flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
> +				struct udp_flow *uflow = udp_at_sidx(batchsidx);
> +
> +				flow_err(uflow,
> +					"No support for forwarding UDP from %s to %s",
> +					pif_name(pif_at_sidx(fromsidx)),
> +					pif_name(batchpif));
> +			} else {
> +				debug("Discarding 1 datagram without flow");
> +			}
> +
> +			continue;
> +		}
> +
> +		toside = flowside_at_sidx(batchsidx);
> +
> +		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> +
> +		iov_used = udp_vu_sock_recv(c, ref.fd, events, v6, &dlen);
> +		if (iov_used <= 0)
> +			break;
> +
> +		udp_vu_prepare(c, toside, dlen);
> +		if (*c->pcap) {
> +			udp_vu_csum(toside, iov_used);
> +			pcap_iov(iov_vu, iov_used,
> +				 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +		}
> +		vu_flush(vdev, vq, elem, iov_used);

You're flushing on every datagram.  Is that intentional, rather than
leaving it until you've processed the whole loop's worth?

> +	}
> +}
> +
> +/**
> + * udp_vu_reply_sock_handler() - Handle new data from flow specific socket
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @events:	epoll events bitmap
> + * @now:	Current timestamp
> + */
> +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			        uint32_t events, const struct timespec *now)
> +{
> +	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
> +	const struct flowside *toside = flowside_at_sidx(tosidx);
> +	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
> +	int from_s = uflow->s[ref.flowside.sidei];
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int i;
> +
> +	ASSERT(!c->no_udp);
> +
> +	if (udp_sock_errs(c, from_s, events) < 0) {
> +		flow_err(uflow, "Unrecoverable error on reply socket");
> +		flow_err_details(uflow);
> +		udp_flow_close(c, uflow);
> +		return;
> +	}
> +
> +	for (i = 0; i < UDP_MAX_FRAMES; i++) {
> +		uint8_t topif = pif_at_sidx(tosidx);
> +		ssize_t dlen;
> +		int iov_used;
> +		bool v6;
> +
> +		ASSERT(uflow);
> +
> +		if (topif != PIF_TAP) {
> +			uint8_t frompif = pif_at_sidx(ref.flowside);
> +
> +			flow_err(uflow,
> +				 "No support for forwarding UDP from %s to %s",
> +				 pif_name(frompif), pif_name(topif));
> +			continue;
> +		}
> +
> +		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> +
> +		iov_used = udp_vu_sock_recv(c, from_s, events, v6, &dlen);
> +		if (iov_used <= 0)
> +			break;
> +		flow_trace(uflow, "Received 1 datagram on reply socket");
> +		uflow->ts = now->tv_sec;
> +
> +		udp_vu_prepare(c, toside, dlen);
> +		if (*c->pcap) {
> +			udp_vu_csum(toside, iov_used);
> +			pcap_iov(iov_vu, iov_used,
> +				 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +		}
> +		vu_flush(vdev, vq, elem, iov_used);
> +	}
> +}
> diff --git a/udp_vu.h b/udp_vu.h
> new file mode 100644
> index 000000000000..ba7018d3bf01
> --- /dev/null
> +++ b/udp_vu.h
> @@ -0,0 +1,13 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef UDP_VU_H
> +#define UDP_VU_H
> +
> +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now);
> +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			       uint32_t events, const struct timespec *now);
> +#endif /* UDP_VU_H */
> diff --git a/vhost_user.c b/vhost_user.c
> index 1e302926b8fe..e905f3329f71 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -48,12 +48,13 @@
>  /* vhost-user version we are compatible with */
>  #define VHOST_USER_VERSION 1
>  
> +static struct vu_dev vdev_storage;
> +
>  /**
>   * vu_print_capabilities() - print vhost-user capabilities
>   * 			     this is part of the vhost-user backend
>   * 			     convention.
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_print_capabilities(void)
>  {
>  	info("{");
> @@ -163,9 +164,7 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
>   */
>  static void vu_remove_watch(const struct vu_dev *vdev, int fd)
>  {
> -	/* Placeholder to add passt related code */
> -	(void)vdev;
> -	(void)fd;
> +	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
>  }
>  
>  /**
> @@ -487,6 +486,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
>  		}
>  	}
>  
> +	/* As vu_packet_check_range() has no access to the number of
> +	 * memory regions, mark the end of the array with mmap_addr = 0
> +	 */
> +	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
> +	vdev->regions[vdev->nregions].mmap_addr = 0;
> +
> +	tap_sock_update_pool(vdev->regions, 0);
> +
>  	return false;
>  }
>  
> @@ -615,9 +622,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
>   */
>  static void vu_set_watch(const struct vu_dev *vdev, int fd)
>  {
> -	/* Placeholder to add passt related code */
> -	(void)vdev;
> -	(void)fd;
> +	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
> +	struct epoll_event ev = { 0 };
> +
> +	ev.data.u64 = ref.u64;
> +	ev.events = EPOLLIN;
> +	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev);
>  }
>  
>  /**
> @@ -829,14 +839,14 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
>   * @c:		execution context
>   * @vdev:	vhost-user device
>   */
> -/* cppcheck-suppress unusedFunction */
> -void vu_init(struct ctx *c, struct vu_dev *vdev)
> +void vu_init(struct ctx *c)
>  {
>  	int i;
>  
> -	vdev->context = c;
> +	c->vdev = &vdev_storage;
> +	c->vdev->context = c;
>  	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> -		vdev->vq[i] = (struct vu_virtq){
> +		c->vdev->vq[i] = (struct vu_virtq){
>  			.call_fd = -1,
>  			.kick_fd = -1,
>  			.err_fd = -1,
> @@ -849,7 +859,6 @@ void vu_init(struct ctx *c, struct vu_dev *vdev)
>   * vu_cleanup() - Reset vhost-user device
>   * @vdev:	vhost-user device
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_cleanup(struct vu_dev *vdev)
>  {
>  	unsigned int i;
> @@ -896,8 +905,7 @@ void vu_cleanup(struct vu_dev *vdev)
>   */
>  static void vu_sock_reset(struct vu_dev *vdev)
>  {
> -	/* Placeholder to add passt related code */
> -	(void)vdev;
> +	tap_sock_reset(vdev->context);
>  }
>  
>  static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
> @@ -925,7 +933,6 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
>   * @fd:		vhost-user message socket
>   * @events:	epoll events
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
>  {
>  	struct vhost_user_msg msg = { 0 };
> diff --git a/vhost_user.h b/vhost_user.h
> index 5af349ba58b8..464ba21e962f 100644
> --- a/vhost_user.h
> +++ b/vhost_user.h
> @@ -183,7 +183,6 @@ struct vhost_user_msg {
>   *
>   * Return: true if the virqueue is enabled, false otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  static inline bool vu_queue_enabled(const struct vu_virtq *vq)
>  {
>  	return vq->enable;
> @@ -195,14 +194,13 @@ static inline bool vu_queue_enabled(const struct vu_virtq *vq)
>   *
>   * Return: true if the virqueue is started, false otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  static inline bool vu_queue_started(const struct vu_virtq *vq)
>  {
>  	return vq->started;
>  }
>  
>  void vu_print_capabilities(void);
> -void vu_init(struct ctx *c, struct vu_dev *vdev);
> +void vu_init(struct ctx *c);
>  void vu_cleanup(struct vu_dev *vdev);
>  void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);
>  #endif /* VHOST_USER_H */
> diff --git a/virtio.c b/virtio.c
> index 380590afbca3..0598ff479858 100644
> --- a/virtio.c
> +++ b/virtio.c
> @@ -328,7 +328,6 @@ static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>   * @dev:	Vhost-user device
>   * @vq:		Virtqueue
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>  {
>  	if (!vq->vring.avail)
> @@ -504,7 +503,6 @@ static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned i
>   *
>   * Return: -1 if there is an error, 0 otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
>  {
>  	unsigned int head;
> @@ -565,7 +563,6 @@ void vu_queue_unpop(struct vu_virtq *vq)
>   * @vq:		Virtqueue
>   * @num:	Number of element to unpop
>   */
> -/* cppcheck-suppress unusedFunction */
>  bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
>  {
>  	if (num > vq->inuse)
> @@ -621,7 +618,6 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
>   * @len:	Size of the element
>   * @idx:	Used ring entry index
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
>  		   unsigned int len, unsigned int idx)
>  {
> @@ -645,7 +641,6 @@ static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
>   * @vq:		Virtqueue
>   * @count:	Number of entry to flush
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
>  {
>  	uint16_t old, new;
> diff --git a/vu_common.c b/vu_common.c
> new file mode 100644
> index 000000000000..4977d6af0f92
> --- /dev/null
> +++ b/vu_common.c
> @@ -0,0 +1,385 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * common_vu.c - vhost-user common UDP and TCP functions
> + */
> +
> +#include <unistd.h>
> +#include <sys/uio.h>
> +#include <sys/eventfd.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "tap.h"
> +#include "vhost_user.h"
> +#include "pcap.h"
> +#include "vu_common.h"
> +
> +/**
> + * vu_packet_check_range() - Check if a given memory zone is contained in
> + * 			     a mapped guest memory region
> + * @buf:	Array of the available memory regions
> + * @offset:	Offset of data range in packet descriptor
> + * @size:	Length of desired data range
> + * @start:	Start of the packet descriptor
> + *
> + * Return: 0 if the zone is in a mapped memory region, -1 otherwise
> + */
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start)
> +{
> +	struct vu_dev_region *dev_region;
> +
> +	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		char *m = (char *)dev_region->mmap_addr;
> +
> +		if (m <= start &&
> +		    start + offset + len <= m + dev_region->mmap_offset +
> +					       dev_region->size)
> +			return 0;
> +	}
> +
> +	return -1;
> +}
> +
> +/**
> + * vu_init_elem() - initialize an array of virtqueue element with 1 iov in each
> + * @elem:	Array of virtqueue element to initialize
> + * @iov:	Array of iovec to assign to virtqueue element
> + * @elem_cnt:	Number of virtqueue element
> + */
> +void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov, int elem_cnt)
> +{
> +	int i;
> +
> +	for (i = 0; i < elem_cnt; i++) {
> +		elem[i].out_num = 0;
> +		elem[i].out_sg = NULL;
> +		elem[i].in_num = 1;
> +		elem[i].in_sg = &iov[i];
> +	}
> +}
> +
> +/**
> + * vu_collect_one_frame() - collect virtio buffers from a given virtqueue for
> + *			    one frame
> + * @vdev:		vhost-user device
> + * @vq:			virtqueue to collect from
> + * @elem:		Array of virtqueue element
> + * 			each element must be initialized with one iovec entry
> + * 			in the in_sg array.
> + * @max_elem:		Number of virtqueue element in the array
> + * @size:		Maximum size of the data in the frame
> + * @hdrlen:		Size of the frame header

No comment for @frame_size.

> + */
> +int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq,
> +			 struct vu_virtq_element *elem, int max_elem,
> +			 size_t size, size_t hdrlen, size_t *frame_size)
> +{
> +	size_t current_size = 0;
> +	struct iovec *iov;
> +	int elem_cnt = 0;
> +	int ret;
> +
> +	/* header is at least virtio_net_hdr_mrg_rxbuf */
> +	hdrlen += sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +
> +	/* collect first (or unique) element, it will contain header */
> +	ret = vu_queue_pop(vdev, vq, &elem[0]);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (elem[0].in_num < 1) {
> +		warn("virtio-net receive queue contains no in buffers");

If this is always called for the receive queue, do you actually need
the vq parameter?

> +		vu_queue_detach_element(vq);
> +		goto out;
> +	}
> +
> +	iov = &elem[elem_cnt].in_sg[0];
> +
> +	ASSERT(iov->iov_len >= hdrlen);

This could occur because of the guest/VMM doing something odd, not a
passt internal bug, yes?  In which case it should not be an ASSERT(),
but a die() (or better yet, a less violent recovery).

> +	/* add space for header */
> +	iov->iov_base = (char *)iov->iov_base + hdrlen;
> +	iov->iov_len -= hdrlen;
> +
> +	if (iov->iov_len > size)
> +		iov->iov_len = size;
> +
> +	elem_cnt++;
> +	current_size = iov->iov_len;
> +
> +	if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		goto out;
> +
> +	/* if possible, coalesce buffers to reach size */
> +	while (current_size < size && elem_cnt < max_elem) {
> +
> +		ret = vu_queue_pop(vdev, vq, &elem[elem_cnt]);
> +		if (ret < 0)
> +			break;
> +
> +		if (elem[elem_cnt].in_num < 1) {
> +			warn("virtio-net receive queue contains no in buffers");
> +			vu_queue_detach_element(vq);
> +			break;
> +		}
> +
> +		iov = &elem[elem_cnt].in_sg[0];
> +
> +		if (iov->iov_len > size - current_size)
> +			iov->iov_len = size - current_size;
> +
> +		current_size += iov->iov_len;
> +		elem_cnt++;
> +	}
> +
> +out:
> +	if (frame_size)
> +		*frame_size = current_size;
> +
> +	return elem_cnt;
> +}
> +
> +/**
> + * vu_collect() - collect virtio buffers from a given virtqueue
> + * @vdev:		vhost-user device
> + * @vq:			virtqueue to collect from
> + * @elem:		Array of virtqueue element
> + * 			each element must be initialized with one iovec entry
> + * 			in the in_sg array.
> + * @max_elem:		Number of virtqueue element in the array
> + * @max_frame_size:	Maximum size of the data in the frame
> + * @hdrlen:		Size of the frame header
> + * @size:		Total size of the buffers we need to collect
> + * 			(if size > max_frame_size, we collect several frame)
> + */
> +int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
> +	       struct vu_virtq_element *elem, int max_elem,
> +	       size_t max_frame_size, size_t hdrlen, size_t size)
> +{
> +	int elem_cnt = 0;
> +
> +	while (size > 0 && elem_cnt < max_elem) {
> +		size_t frame_size;
> +		int cnt;
> +
> +		if (max_frame_size > size)
> +			max_frame_size = size;
> +
> +		cnt = vu_collect_one_frame(vdev, vq,
> +					   &elem[elem_cnt], max_elem - elem_cnt,
> +					   max_frame_size, hdrlen, &frame_size);
> +		if (cnt == 0)
> +			break;
> +
> +		size -= frame_size;
> +		elem_cnt += cnt;
> +
> +		if (frame_size < max_frame_size)
> +			break;

The reason for exiting the loop here is not clear to me.

> +	}
> +
> +	return elem_cnt;
> +}
> +
> +/**
> + * vu_set_vnethdr() - set virtio-net headers in a given iovec
> + * @vdev:		vhost-user device
> + * @iov:		One iovec to initialize
> + * @num_buffers:	Number of guest buffers of the frame
> + * @hdrlen:		Size of the frame header
> + */
> +void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov,
> +		    int num_buffers, size_t hdrlen)
> +{
> +	struct virtio_net_hdr_mrg_rxbuf *vnethdr;
> +
> +	/* header is at least virtio_net_hdr_mrg_rxbuf */
> +	hdrlen += sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +
> +	/* NOLINTNEXTLINE(clang-analyzer-core.UndefinedBinaryOperatorResult) */
> +	iov->iov_base = (char *)iov->iov_base - hdrlen;

Altering the IOV on the fly between the whole frames and just the data
payload part I find kind of hard to track.  If it's possible I think
things would be clearer if we separately kept the "whole frame" and
"just payload" IOVs.  The former could reasonably remain embedded
within the elem structures.  As a side effect I think that would
remove the need for the clang suppression.

> +	iov->iov_len += hdrlen;
> +
> +	vnethdr = iov->iov_base;
> +	vnethdr->hdr = VU_HEADER;
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		vnethdr->num_buffers = htole16(num_buffers);
> +}
> +
> +/**
> + * vu_flush() - flush all the collected buffers to the vhost-user interface
> + * @vdev:	vhost-user device
> + * @vq:		vhost-user virtqueue
> + * @elem:	virtqueue element array to send back to the virqueue
> + * @iov_used:	Length of the array
> + */
> +void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
> +	      struct vu_virtq_element *elem, int elem_cnt)
> +{
> +	int i;
> +
> +	for (i = 0; i < elem_cnt; i++)
> +		vu_queue_fill(vq, &elem[i], elem[i].in_sg[0].iov_len, i);
> +
> +	vu_queue_flush(vq, elem_cnt);
> +	vu_queue_notify(vdev, vq);
> +}
> +
> +/**
> + * vu_handle_tx() - Receive data from the TX virtqueue
> + * @vdev:	vhost-user device
> + * @index:	index of the virtqueue
> + * @now:	Current timestamp
> + */
> +static void vu_handle_tx(struct vu_dev *vdev, int index,
> +			 const struct timespec *now)
> +{
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> +	struct vu_virtq *vq = &vdev->vq[index];
> +	int hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	int out_sg_count;
> +	int count;
> +
> +	if (!VHOST_USER_IS_QUEUE_TX(index)) {
> +		debug("vhost-user: index %d is not a TX queue", index);
> +		return;
> +	}
> +
> +	tap_flush_pools();
> +
> +	count = 0;
> +	out_sg_count = 0;
> +	while (count < VIRTQUEUE_MAX_SIZE) {
> +		int ret;
> +
> +		elem[count].out_num = 1;
> +		elem[count].out_sg = &out_sg[out_sg_count];
> +		elem[count].in_num = 0;
> +		elem[count].in_sg = NULL;
> +		ret = vu_queue_pop(vdev, vq, &elem[count]);
> +		if (ret < 0)
> +			break;
> +		out_sg_count += elem[count].out_num;
> +
> +		if (elem[count].out_num < 1) {
> +			debug("virtio-net header not in first element");

This error message doesn't seem right for the check.

> +			break;
> +		}
> +		ASSERT(elem[count].out_num == 1);
> +
> +		tap_add_packet(vdev->context,
> +			       elem[count].out_sg[0].iov_len - hdrlen,
> +			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
> +		count++;
> +	}
> +	tap_handler(vdev->context, now);
> +
> +	if (count) {
> +		int i;
> +
> +		for (i = 0; i < count; i++)
> +			vu_queue_fill(vq, &elem[i], 0, i);
> +		vu_queue_flush(vq, count);
> +		vu_queue_notify(vdev, vq);

For just recycling buffers, doing the flush/notify for every frame
seems even more dubious.

> +	}
> +}
> +
> +/**
> + * vu_kick_cb() - Called on a kick event to start to receive data
> + * @vdev:	vhost-user device
> + * @ref:	epoll reference information
> + * @now:	Current timestamp
> + */
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now)
> +{
> +	eventfd_t kick_data;
> +	ssize_t rc;
> +	int idx;
> +
> +	for (idx = 0; idx < VHOST_USER_MAX_QUEUES; idx++) {
> +		if (vdev->vq[idx].kick_fd == ref.fd)
> +			break;
> +	}

We should be able to put the queue index into the epoll_ref, rather
than doing a linear scan for the right queue.

> +
> +	if (idx == VHOST_USER_MAX_QUEUES)
> +		return;
> +
> +	rc = eventfd_read(ref.fd, &kick_data);
> +	if (rc == -1)
> +		die_perror("vhost-user kick eventfd_read()");
> +
> +	debug("vhost-user: ot kick_data: %016"PRIx64" idx:%d",
> +	      kick_data, idx);
> +	if (VHOST_USER_IS_QUEUE_TX(idx))
> +		vu_handle_tx(vdev, idx, now);
> +}
> +
> +/**
> + * vu_send_single() - Send a buffer to the front-end using the RX virtqueue
> + * @c:		execution context
> + * @buf:	address of the buffer
> + * @size:	size of the buffer
> + *
> + * Return: number of bytes sent, -1 if there is an error
> + */
> +int vu_send_single(const struct ctx *c, const void *buf, size_t size)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> +	size_t total;
> +	int elem_cnt, max_elem;
> +	int i;
> +
> +	debug("vu_send_single size %zu", size);
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		err("Got packet, but no available descriptors on RX virtq.");

Error message doesn't seem quite right.

> +		return 0;
> +	}
> +
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		max_elem = VIRTQUEUE_MAX_SIZE;
> +	else
> +		max_elem = 1;
> +
> +	vu_init_elem(elem, in_sg, max_elem);
> +
> +	elem_cnt = vu_collect_one_frame(vdev, vq, elem, max_elem, size,
> +					0, &total);
> +	if (total < size) {
> +		debug("vu_send_single: no space to send the data "
> +		      "elem_cnt %d size %zd", elem_cnt, total);
> +		goto err;
> +	}
> +
> +	vu_set_vnethdr(vdev, in_sg, elem_cnt, 0);
> +
> +	/* copy data from the buffer to the iovec */
> +	iov_from_buf(in_sg, elem_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf),
> +		     buf, size);
> +
> +	if (*c->pcap) {
> +		pcap_iov(in_sg, elem_cnt,
> +			 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +	}
> +
> +	vu_flush(vdev, vq, elem, elem_cnt);
> +
> +	debug("vhost-user sent %zu", total);
> +
> +	return total;
> +err:
> +	for (i = 0; i < elem_cnt; i++)
> +		vu_queue_detach_element(vq);
> +
> +	return 0;
> +}
> diff --git a/vu_common.h b/vu_common.h
> new file mode 100644
> index 000000000000..1d6048060059
> --- /dev/null
> +++ b/vu_common.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * vhost-user common UDP and TCP functions
> + */
> +
> +#ifndef VU_COMMON_H
> +#define VU_COMMON_H
> +#include <linux/virtio_net.h>
> +
> +static inline void *vu_eth(void *base)
> +{
> +	return ((char *)base + sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +}
> +
> +static inline void *vu_ip(void *base)
> +{
> +	return (struct ethhdr *)vu_eth(base) + 1;
> +}
> +
> +static inline void *vu_payloadv4(void *base)
> +{
> +	return (struct iphdr *)vu_ip(base) + 1;
> +}
> +
> +static inline void *vu_payloadv6(void *base)
> +{
> +	return (struct ipv6hdr *)vu_ip(base) + 1;
> +}
> +
> +void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov,
> +		  int elem_cnt);
> +int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq,
> +			 struct vu_virtq_element *elem, int max_elem,
> +			 size_t size, size_t hdrlen, size_t *frame_size);
> +int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
> +	       struct vu_virtq_element *elem, int max_elem,
> +	       size_t max_frame_size, size_t hdrlen, size_t size);
> +void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov,
> +                    int num_buffers, size_t hdrlen);
> +void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
> +	      struct vu_virtq_element *elem, int elem_cnt);
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now);
> +int vu_send_single(const struct ctx *c, const void *buf, size_t size);
> +#endif /* VU_COMMON_H */

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 8/8] test: Add tests for passt in vhost-user mode
  2024-10-10 12:29 ` [PATCH v8 8/8] test: Add tests for passt in vhost-user mode Laurent Vivier
@ 2024-10-15  3:40   ` David Gibson
  2024-10-15 19:54   ` Stefano Brivio
  1 sibling, 0 replies; 50+ messages in thread
From: David Gibson @ 2024-10-15  3:40 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev, Stefano Brivio

[-- Attachment #1: Type: text/plain, Size: 25640 bytes --]

On Thu, Oct 10, 2024 at 02:29:02PM +0200, Laurent Vivier wrote:
> From: Stefano Brivio <sbrivio@redhat.com>
> 
> Run functional and performance tests for vhost-user mode as well. For
> functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
> to their non-vhost-user counterparts, as no differences are intended
> but we want to distinguish them in test logs.
> 
> For performance tests, instead, we add separate perf/passt_vu_tcp and
> perf/passt_vu_udp files, as we need longer test duration, as well as
> higher UDP sending bandwidths and larger TCP windows, to actually get
> the highest throughput vhost-user mode offers.
> 
> For valgrind tests, vhost-user mode needs two extra system calls:
> statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind
> target.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  Makefile               |   3 +-
>  test/lib/perf_report   |  15 +++
>  test/lib/setup         |  77 ++++++++++++---
>  test/lib/setup_ugly    |   2 +-
>  test/passt_vu          |   1 +
>  test/passt_vu_in_ns    |   1 +
>  test/perf/passt_vu_tcp | 211 +++++++++++++++++++++++++++++++++++++++++
>  test/perf/passt_vu_udp | 159 +++++++++++++++++++++++++++++++
>  test/run               |  25 +++++
>  test/two_guests_vu     |   1 +
>  10 files changed, 479 insertions(+), 16 deletions(-)
>  create mode 120000 test/passt_vu
>  create mode 120000 test/passt_vu_in_ns
>  create mode 100644 test/perf/passt_vu_tcp
>  create mode 100644 test/perf/passt_vu_udp
>  create mode 120000 test/two_guests_vu
> 
> diff --git a/Makefile b/Makefile
> index 1e8910dda1f4..ce8aa4302790 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -138,7 +138,8 @@ qrap: $(QRAP_SRCS) passt.h
>  
>  valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction	\
>  			    rt_sigreturn getpid gettid kill clock_gettime mmap \
> -			    mmap2 munmap open unlink gettimeofday futex
> +			    mmap2 munmap open unlink gettimeofday futex statx \
> +			    readlink
>  valgrind: FLAGS += -g -DVALGRIND
>  valgrind: all
>  
> diff --git a/test/lib/perf_report b/test/lib/perf_report
> index d1ef50bfe0d5..c4ec817bcd1e 100755
> --- a/test/lib/perf_report
> +++ b/test/lib/perf_report
> @@ -49,6 +49,21 @@ td:empty { visibility: hidden; }
>  	__passt_tcp_LINE__ __passt_udp_LINE__
>  </table>
>  
> +</li><li><p>passt with vhost-user support</p>
> +<table class="passt" width="70%">
> +	<tr>
> +		<th/>
> +		<th id="perf_passt_vu_tcp" colspan="__passt_vu_tcp_cols__">TCP, __passt_vu_tcp_threads__ at __passt_vu_tcp_freq__ GHz</th>
> +		<th id="perf_passt_vu_udp" colspan="__passt_vu_udp_cols__">UDP, __passt_vu_udp_threads__ at __passt_vu_udp_freq__ GHz</th>
> +	</tr>
> +	<tr>
> +		<td align="right">MTU:</td>
> +		__passt_vu_tcp_header__
> +		__passt_vu_udp_header__
> +	</tr>
> +	__passt_vu_tcp_LINE__ __passt_vu_udp_LINE__
> +</table>
> +
>  <style type="text/CSS">
>  table.pasta_local td { border: 0px solid; padding: 6px; line-height: 1; }
>  table.pasta_local td { text-align: right; }
> diff --git a/test/lib/setup b/test/lib/setup
> index 5338393ce35c..3409bd29cd81 100755
> --- a/test/lib/setup
> +++ b/test/lib/setup
> @@ -15,8 +15,7 @@
>  
>  INITRAMFS="${BASEPATH}/mbuto.img"
>  VCPUS="$( [ $(nproc) -ge 8 ] && echo 6 || echo $(( $(nproc) / 2 + 1 )) )"
> -__mem_kib="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
> -VMEM="$((${__mem_kib} / 1024 / 4))"
> +MEM_KIB="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
>  QEMU_ARCH="$(uname -m)"
>  [ "${QEMU_ARCH}" = "i686" ] && QEMU_ARCH=i386
>  
> @@ -46,6 +45,7 @@ setup_passt() {
>  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt.pcap"
>  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
>  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> +	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
>  
>  	context_run passt "make clean"
>  	context_run passt "make valgrind"
> @@ -54,16 +54,29 @@ setup_passt() {
>  	# pidfile isn't created until passt is listening
>  	wait_for [ -f "${STATESETUP}/passt.pid" ]
>  
> +	__vmem="$((${MEM_KIB} / 1024 / 4))"
> +	if [ ${VHOST_USER} -eq 1 ]; then
> +		__vmem="$(((${__vmem} + 500) / 1000))G"
> +		__qemu_netdev="						       \
> +			-chardev socket,id=c,path=${STATESETUP}/passt.socket   \
> +			-netdev vhost-user,id=v,chardev=c		       \
> +			-device virtio-net,netdev=v			       \
> +			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
> +			-numa node,memdev=m"
> +	else
> +		__qemu_netdev="-device virtio-net-pci,netdev=s		       \
> +			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
> +	fi
> +
>  	GUEST_CID=94557
>  	context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}"		   \
>  		' -machine accel=kvm'                                      \
> -		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
> +		' -m '${__vmem}' -cpu host -smp '${VCPUS}		   \
>  		' -kernel '"${KERNEL}"					   \
>  		' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
>  		' -nodefaults'						   \
>  		' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
> -		' -device virtio-net-pci,netdev=s0 '			   \
> -		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
> +		" ${__qemu_netdev}"					   \
>  		" -pidfile ${STATESETUP}/qemu.pid"			   \
>  		" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
>  
> @@ -142,6 +155,7 @@ setup_passt_in_ns() {
>  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_in_pasta.pcap"
>  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
>  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> +	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
>  
>  	if [ ${VALGRIND} -eq 1 ]; then
>  		context_run passt "make clean"
> @@ -154,17 +168,30 @@ setup_passt_in_ns() {
>  	fi
>  	wait_for [ -f "${STATESETUP}/passt.pid" ]
>  
> +	__vmem="$((${MEM_KIB} / 1024 / 4))"
> +	if [ ${VHOST_USER} -eq 1 ]; then
> +		__vmem="$(((${__vmem} + 500) / 1000))G"
> +		__qemu_netdev="						       \
> +			-chardev socket,id=c,path=${STATESETUP}/passt.socket   \
> +			-netdev vhost-user,id=v,chardev=c		       \
> +			-device virtio-net,netdev=v			       \
> +			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
> +			-numa node,memdev=m"
> +	else
> +		__qemu_netdev="-device virtio-net-pci,netdev=s		       \
> +			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket"
> +	fi
> +
>  	GUEST_CID=94557
>  	context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}"		   \
>  		' -machine accel=kvm'                                      \
>  		' -M accel=kvm:tcg'                                        \
> -		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
> +		' -m '${__vmem}' -cpu host -smp '${VCPUS}		   \
>  		' -kernel '"${KERNEL}"					   \
>  		' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
>  		' -nodefaults'						   \
>  		' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
> -		' -device virtio-net-pci,netdev=s0 '			   \
> -		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
> +		" ${__qemu_netdev}"					   \
>  		" -pidfile ${STATESETUP}/qemu.pid"			   \
>  		" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
>  
> @@ -214,6 +241,7 @@ setup_two_guests() {
>  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
>  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
>  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> +	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
>  
>  	context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
>  	wait_for [ -f "${STATESETUP}/passt_1.pid" ]
> @@ -222,33 +250,54 @@ setup_two_guests() {
>  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
>  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
>  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> +	[ ${VHOST_USER} -eq 1 ] && __opts="${__opts} --vhost-user"
>  
>  	context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
>  	wait_for [ -f "${STATESETUP}/passt_2.pid" ]
>  
> +	__vmem="$((${MEM_KIB} / 1024 / 4))"
> +	if [ ${VHOST_USER} -eq 1 ]; then
> +		__vmem="$(((${__vmem} + 500) / 1000))G"
> +		__qemu_netdev1="					       \
> +			-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
> +			-netdev vhost-user,id=v,chardev=c		       \
> +			-device virtio-net,netdev=v			       \
> +			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
> +			-numa node,memdev=m"
> +		__qemu_netdev1="					       \
> +			-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
> +			-netdev vhost-user,id=v,chardev=c		       \
> +			-device virtio-net,netdev=v			       \
> +			-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
> +			-numa node,memdev=m"
> +	else
> +		__qemu_netdev1="-device virtio-net-pci,netdev=s		       \
> +			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket"
> +		__qemu_netdev2="-device virtio-net-pci,netdev=s		       \
> +			-netdev stream,id=s,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket"
> +	fi
> +
>  	GUEST_1_CID=94557
>  	context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}"		     \
>  		' -M accel=kvm:tcg'                                          \
> -		' -m '${VMEM}' -cpu host -smp '${VCPUS}                      \
> +		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
>  		' -kernel '"${KERNEL}"					     \
>  		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
>  		' -nodefaults'						     \
>  		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
> -		' -device virtio-net-pci,netdev=s0 '			     \
> -		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_1.socket " \
> +		" ${__qemu_netdev1}"					     \
>  		" -pidfile ${STATESETUP}/qemu_1.pid"			     \
>  		" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"
>  
>  	GUEST_2_CID=94558
>  	context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}"		     \
>  		' -M accel=kvm:tcg'                                          \
> -		' -m '${VMEM}' -cpu host -smp '${VCPUS}                      \
> +		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
>  		' -kernel '"${KERNEL}"					     \
>  		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
>  		' -nodefaults'						     \
>  		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
> -		' -device virtio-net-pci,netdev=s0 '			     \
> -		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt_2.socket " \
> +		" ${__qemu_netdev2}"					     \
>  		" -pidfile ${STATESETUP}/qemu_2.pid"			     \
>  		" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"
>  
> diff --git a/test/lib/setup_ugly b/test/lib/setup_ugly
> index 4b2a0774de1d..2802cc3bb43b 100755
> --- a/test/lib/setup_ugly
> +++ b/test/lib/setup_ugly
> @@ -33,7 +33,7 @@ setup_memory() {
>  
>  	pane_or_context_run guest 'qemu-system-$(uname -m)'		   \
>  		' -machine accel=kvm'                                      \
> -		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
> +		' -m '$((${MEM_KIB} / 1024 / 4))' -cpu host -smp '${VCPUS}                    \
>  		' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
>  		' -initrd '${INITRAMFS_MEM}' -nographic -serial stdio'	   \
>  		' -nodefaults'						   \
> diff --git a/test/passt_vu b/test/passt_vu
> new file mode 120000
> index 000000000000..22f1840d1ad6
> --- /dev/null
> +++ b/test/passt_vu
> @@ -0,0 +1 @@
> +passt
> \ No newline at end of file
> diff --git a/test/passt_vu_in_ns b/test/passt_vu_in_ns
> new file mode 120000
> index 000000000000..3ff479e0436b
> --- /dev/null
> +++ b/test/passt_vu_in_ns
> @@ -0,0 +1 @@
> +passt_in_ns
> \ No newline at end of file
> diff --git a/test/perf/passt_vu_tcp b/test/perf/passt_vu_tcp
> new file mode 100644
> index 000000000000..b43400804e64
> --- /dev/null
> +++ b/test/perf/passt_vu_tcp
> @@ -0,0 +1,211 @@
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +#
> +# PASST - Plug A Simple Socket Transport
> +#  for qemu/UNIX domain socket mode
> +#
> +# PASTA - Pack A Subtle Tap Abstraction
> +#  for network namespace/tap device mode
> +#
> +# test/perf/passt_vu_tcp - Check TCP performance in passt vhost-user mode
> +#
> +# Copyright (c) 2021 Red Hat GmbH
> +# Author: Stefano Brivio <sbrivio@redhat.com>
> +
> +gtools	/sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr # From neper
> +nstools	/sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr
> +htools	bc head sed seq
> +
> +set	MAP_NS4 192.0.2.2
> +set	MAP_NS6 2001:db8:9a55::2
> +
> +test	passt: throughput and latency
> +
> +guest	/sbin/sysctl -w net.core.rmem_max=536870912
> +guest	/sbin/sysctl -w net.core.wmem_max=536870912
> +guest	/sbin/sysctl -w net.core.rmem_default=33554432
> +guest	/sbin/sysctl -w net.core.wmem_default=33554432
> +guest	/sbin/sysctl -w net.ipv4.tcp_rmem="4096 131072 268435456"
> +guest	/sbin/sysctl -w net.ipv4.tcp_wmem="4096 131072 268435456"
> +guest	/sbin/sysctl -w net.ipv4.tcp_timestamps=0
> +
> +ns	/sbin/sysctl -w net.ipv4.tcp_rmem="4096 524288 134217728"
> +ns	/sbin/sysctl -w net.ipv4.tcp_wmem="4096 524288 134217728"
> +ns	/sbin/sysctl -w net.ipv4.tcp_timestamps=0
> +
> +gout	IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
> +
> +hout	FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
> +hout	FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
> +hout	FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
> +
> +set	THREADS 4
> +set	TIME 5
> +set	OMIT 0.1
> +set	OPTS -Z -P __THREADS__ -l 1M -O__OMIT__ -N
> +
> +info	Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
> +report	passt_vu tcp __THREADS__ __FREQ__
> +
> +th	MTU 256B 576B 1280B 1500B 9000B 65520B
> +
> +
> +tr	TCP throughput over IPv6: guest to host
> +iperf3s	ns 10002
> +
> +bw	-
> +bw	-
> +guest	ip link set dev __IFNAME__ mtu 1280
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 16M
> +bw	__BW__ 1.2 1.5
> +guest	ip link set dev __IFNAME__ mtu 1500
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 32M
> +bw	__BW__ 1.6 1.8
> +guest	ip link set dev __IFNAME__ mtu 9000
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M
> +bw	__BW__ 4.0 5.0
> +guest	ip link set dev __IFNAME__ mtu 65520
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M
> +bw	__BW__ 7.0 8.0
> +
> +iperf3k	ns
> +
> +tl	TCP RR latency over IPv6: guest to host
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +nsb	tcp_rr --nolog -6
> +gout	LAT tcp_rr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +tl	TCP CRR latency over IPv6: guest to host
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +nsb	tcp_crr --nolog -6
> +gout	LAT tcp_crr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 500 400
> +
> +tr	TCP throughput over IPv4: guest to host
> +iperf3s	ns 10002
> +
> +guest	ip link set dev __IFNAME__ mtu 256
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 2M
> +bw	__BW__ 0.2 0.3
> +guest	ip link set dev __IFNAME__ mtu 576
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 4M
> +bw	__BW__ 0.5 0.8
> +guest	ip link set dev __IFNAME__ mtu 1280
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 8M
> +bw	__BW__ 1.2 1.5
> +guest	ip link set dev __IFNAME__ mtu 1500
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 16M
> +bw	__BW__ 1.6 1.8
> +guest	ip link set dev __IFNAME__ mtu 9000
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M
> +bw	__BW__ 4.0 5.0
> +guest	ip link set dev __IFNAME__ mtu 65520
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M
> +bw	__BW__ 7.0 8.0
> +
> +iperf3k	ns
> +
> +# Reducing MTU below 1280 deconfigures IPv6, get our address back
> +guest	dhclient -6 -x
> +guest	dhclient -6 __IFNAME__
> +
> +tl	TCP RR latency over IPv4: guest to host
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +nsb	tcp_rr --nolog -4
> +gout	LAT tcp_rr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +tl	TCP CRR latency over IPv4: guest to host
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +nsb	tcp_crr --nolog -4
> +gout	LAT tcp_crr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 500 400
> +
> +tr	TCP throughput over IPv6: host to guest
> +iperf3s	guest 10001
> +
> +bw	-
> +bw	-
> +bw	-
> +bw	-
> +bw	-
> +iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -w 32M
> +bw	__BW__ 6.0 6.8
> +
> +iperf3k	guest
> +
> +tl	TCP RR latency over IPv6: host to guest
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +guestb	tcp_rr --nolog -P 10001 -C 10011 -6
> +sleep	1
> +nsout	LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +tl	TCP CRR latency over IPv6: host to guest
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +guestb	tcp_crr --nolog -P 10001 -C 10011 -6
> +sleep	1
> +nsout	LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 500 350
> +
> +
> +tr	TCP throughput over IPv4: host to guest
> +iperf3s	guest 10001
> +
> +bw	-
> +bw	-
> +bw	-
> +bw	-
> +bw	-
> +iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -w 32M
> +bw	__BW__ 6.0 6.8
> +
> +iperf3k	guest
> +
> +tl	TCP RR latency over IPv4: host to guest
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +guestb	tcp_rr --nolog -P 10001 -C 10011 -4
> +sleep	1
> +nsout	LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +tl	TCP CRR latency over IPv6: host to guest
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +guestb	tcp_crr --nolog -P 10001 -C 10011 -4
> +sleep	1
> +nsout	LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 500 300
> +
> +te
> diff --git a/test/perf/passt_vu_udp b/test/perf/passt_vu_udp
> new file mode 100644
> index 000000000000..943ac11b4a51
> --- /dev/null
> +++ b/test/perf/passt_vu_udp
> @@ -0,0 +1,159 @@
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +#
> +# PASST - Plug A Simple Socket Transport
> +#  for qemu/UNIX domain socket mode
> +#
> +# PASTA - Pack A Subtle Tap Abstraction
> +#  for network namespace/tap device mode
> +#
> +# test/perf/passt_vu_udp - Check UDP performance in passt vhost-user mode
> +#
> +# Copyright (c) 2021 Red Hat GmbH
> +# Author: Stefano Brivio <sbrivio@redhat.com>
> +
> +gtools	/sbin/sysctl ip jq nproc sleep iperf3 udp_rr # From neper
> +nstools	ip jq sleep iperf3 udp_rr
> +htools	bc head sed
> +
> +set	MAP_NS4 192.0.2.2
> +set	MAP_NS6 2001:db8:9a55::2
> +
> +test	passt: throughput and latency
> +
> +guest	/sbin/sysctl -w net.core.rmem_max=16777216
> +guest	/sbin/sysctl -w net.core.wmem_max=16777216
> +guest	/sbin/sysctl -w net.core.rmem_default=16777216
> +guest	/sbin/sysctl -w net.core.wmem_default=16777216
> +
> +hout	FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
> +hout	FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
> +hout	FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
> +
> +set	THREADS 2
> +set	TIME 1
> +set	OPTS -u -P __THREADS__ --pacing-timer 1000
> +
> +info	Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
> +
> +report	passt_vu udp __THREADS__ __FREQ__
> +
> +th	pktlen 256B 576B 1280B 1500B 9000B 65520B
> +
> +tr	UDP throughput over IPv6: guest to host
> +iperf3s	ns 10002
> +# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
> +
> +bw	-
> +bw	-
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 3G -l 1232
> +bw	__BW__ 0.8 1.2
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 4G -l 1452
> +bw	__BW__ 1.0 1.5
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 10G -l 8952
> +bw	__BW__ 4.0 5.0
> +iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 20G -l 64372
> +bw	__BW__ 4.0 5.0
> +
> +iperf3k	ns
> +
> +tl	UDP RR latency over IPv6: guest to host
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +nsb	udp_rr --nolog -6
> +gout	LAT udp_rr --nolog -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +
> +tr	UDP throughput over IPv4: guest to host
> +iperf3s	ns 10002
> +# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
> +
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 1G -l 228
> +bw	__BW__ 0.0 0.0
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 2G -l 548
> +bw	__BW__ 0.4 0.6
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 3G -l 1252
> +bw	__BW__ 0.8 1.2
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 4G -l 1472
> +bw	__BW__ 1.0 1.5
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 10G -l 8972
> +bw	__BW__ 4.0 5.0
> +iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 20G -l 65492
> +bw	__BW__ 4.0 5.0
> +
> +iperf3k	ns
> +
> +tl	UDP RR latency over IPv4: guest to host
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +nsb	udp_rr --nolog -4
> +gout	LAT udp_rr --nolog -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +
> +tr	UDP throughput over IPv6: host to guest
> +iperf3s	guest 10001
> +# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
> +
> +bw	-
> +bw	-
> +iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 3G -l 1232
> +bw	__BW__ 0.8 1.2
> +iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 4G -l 1452
> +bw	__BW__ 1.0 1.5
> +iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 10G -l 8952
> +bw	__BW__ 3.0 4.0
> +iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 20G -l 64372
> +bw	__BW__ 3.0 4.0
> +
> +iperf3k	guest
> +
> +tl	UDP RR latency over IPv6: host to guest
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +guestb	udp_rr --nolog -P 10001 -C 10011 -6
> +sleep	1
> +nsout	LAT udp_rr --nolog -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +
> +tr	UDP throughput over IPv4: host to guest
> +iperf3s	guest 10001
> +# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
> +
> +iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 1G -l 228
> +bw	__BW__ 0.0 0.0
> +iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 2G -l 548
> +bw	__BW__ 0.4 0.6
> +iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 3G -l 1252
> +bw	__BW__ 0.8 1.2
> +iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 4G -l 1472
> +bw	__BW__ 1.0 1.5
> +iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 10G -l 8972
> +bw	__BW__ 3.0 4.0
> +iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 20G -l 65492
> +bw	__BW__ 3.0 4.0
> +
> +iperf3k	guest
> +
> +tl	UDP RR latency over IPv4: host to guest
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +lat	-
> +guestb	udp_rr --nolog -P 10001 -C 10011 -4
> +sleep	1
> +nsout	LAT udp_rr --nolog -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
> +lat	__LAT__ 200 150
> +
> +te
> diff --git a/test/run b/test/run
> index 547a729b3fbe..f188d8eaf2e0 100755
> --- a/test/run
> +++ b/test/run
> @@ -93,6 +93,7 @@ run() {
>  	test memory/passt
>  	teardown memory
>  
> +	VHOST_USER=0
>  	setup passt
>  	test passt/ndp
>  	test passt/dhcp
> @@ -115,7 +116,22 @@ run() {
>  	test two_guests/basic
>  	teardown two_guests
>  
> +	VHOST_USER=1
> +	setup passt_in_ns
> +	test passt_vu/ndp
> +	test passt_vu_in_ns/dhcp
> +	test passt_vu_in_ns/icmp
> +	test passt_vu_in_ns/tcp
> +	test passt_vu_in_ns/udp
> +	test passt_vu_in_ns/shutdown
> +	teardown passt_in_ns
> +
> +	setup two_guests
> +	test two_guests_vu/basic
> +	teardown two_guests
> +
>  	VALGRIND=0
> +	VHOST_USER=0
>  	setup passt_in_ns
>  	test passt/ndp
>  	test passt_in_ns/dhcp
> @@ -126,6 +142,15 @@ run() {
>  	test passt_in_ns/shutdown
>  	teardown passt_in_ns
>  
> +	VHOST_USER=1
> +	setup passt_in_ns
> +	test passt_vu/ndp
> +	test passt_vu_in_ns/dhcp
> +	test perf/passt_vu_tcp
> +	test perf/passt_vu_udp
> +	test passt_vu_in_ns/shutdown
> +	teardown passt_in_ns
> +
>  	# TODO: Make those faster by at least pre-installing gcc and make on
>  	# non-x86 images, then re-enable.
>  skip_distro() {
> diff --git a/test/two_guests_vu b/test/two_guests_vu
> new file mode 120000
> index 000000000000..144b7cac5438
> --- /dev/null
> +++ b/test/two_guests_vu
> @@ -0,0 +1 @@
> +test/two_guests
> \ No newline at end of file

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 8/8] test: Add tests for passt in vhost-user mode
  2024-10-10 12:29 ` [PATCH v8 8/8] test: Add tests for passt in vhost-user mode Laurent Vivier
  2024-10-15  3:40   ` David Gibson
@ 2024-10-15 19:54   ` Stefano Brivio
  2024-10-16  8:06     ` Laurent Vivier
  1 sibling, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-10-15 19:54 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 10 Oct 2024 14:29:02 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> From: Stefano Brivio <sbrivio@redhat.com>
> 
> Run functional and performance tests for vhost-user mode as well. For
> functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
> to their non-vhost-user counterparts, as no differences are intended
> but we want to distinguish them in test logs.
> 
> [...]
>
> diff --git a/test/two_guests_vu b/test/two_guests_vu
> new file mode 120000
> index 000000000000..144b7cac5438
> --- /dev/null
> +++ b/test/two_guests_vu
> @@ -0,0 +1 @@
> +test/two_guests
> \ No newline at end of file

Oops, this link is wrong: it works if you execute the tests from the
top git directory, but not if you run them from test/. It should simply
point to 'two_guests', not to 'test/two_guests'.

For some reason, even with that fixed, v8 fails for me on that test:
'dhclient -6 eth0' gets stuck in guest #2 (it works in guest #1). I
still need to debug that. I'm fairly sure that v6 or v7 worked.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-10 12:29 ` [PATCH v8 7/8] vhost-user: add vhost-user Laurent Vivier
  2024-10-15  3:23   ` David Gibson
@ 2024-10-15 19:54   ` Stefano Brivio
  2024-10-16  0:41     ` David Gibson
  2024-10-17  0:10   ` Stefano Brivio
  2 siblings, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-10-15 19:54 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[Still partial review]

On Thu, 10 Oct 2024 14:29:01 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> add virtio and vhost-user functions to connect with QEMU.
> 
>   $ ./passt --vhost-user
> 
> and
> 
>   # qemu-system-x86_64 ... -m 4G \
>         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>         -numa node,memdev=memfd0 \
>         -chardev socket,id=chr0,path=/tmp/passt_1.socket \
>         -netdev vhost-user,id=netdev0,chardev=chr0 \
>         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>         ...
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     |   6 +-
>  conf.c       |  21 ++-
>  epoll_type.h |   4 +
>  iov.c        |   1 -
>  isolation.c  |  15 +-
>  packet.c     |  11 ++
>  packet.h     |   8 +-
>  passt.1      |  10 +-
>  passt.c      |   9 +
>  passt.h      |   6 +
>  pcap.c       |   1 -
>  tap.c        |  80 +++++++--
>  tap.h        |   5 +-
>  tcp.c        |   7 +
>  tcp_vu.c     | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcp_vu.h     |  12 ++
>  udp.c        |  10 ++
>  udp_vu.c     | 336 ++++++++++++++++++++++++++++++++++++
>  udp_vu.h     |  13 ++
>  vhost_user.c |  37 ++--
>  vhost_user.h |   4 +-
>  virtio.c     |   5 -
>  vu_common.c  | 385 +++++++++++++++++++++++++++++++++++++++++
>  vu_common.h  |  47 +++++
>  24 files changed, 1454 insertions(+), 55 deletions(-)
>  create mode 100644 tcp_vu.c
>  create mode 100644 tcp_vu.h
>  create mode 100644 udp_vu.c
>  create mode 100644 udp_vu.h
>  create mode 100644 vu_common.c
>  create mode 100644 vu_common.h
> 
> diff --git a/Makefile b/Makefile
> index 0e8ed60a0da1..1e8910dda1f4 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -54,7 +54,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
> +	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> +	vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -64,7 +65,8 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h vhost_user.h virtio.h
> +	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> +	virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/conf.c b/conf.c
> index c63101970155..29d6e41f5770 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -45,6 +45,7 @@
>  #include "lineread.h"
>  #include "isolation.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /**
>   * next_chunk - Return the next piece of a string delimited by a character
> @@ -762,9 +763,14 @@ static void usage(const char *name, FILE *f, int status)
>  			"    default: same interface name as external one\n");
>  	} else {
>  		fprintf(f,
> -			"  -s, --socket PATH	UNIX domain socket path\n"
> +			"  -s, --socket, --socket-path PATH	UNIX domain socket path\n"
>  			"    default: probe free path starting from "
>  			UNIX_SOCK_PATH "\n", 1);
> +		fprintf(f,
> +			"  --vhost-user		Enable vhost-user mode\n"
> +			"    UNIX domain socket is provided by -s option\n"
> +			"  --print-capabilities	print back-end capabilities in JSON format,\n"
> +			"    only meaningful for vhost-user mode\n");
>  	}
>  
>  	fprintf(f,
> @@ -1290,6 +1296,10 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"map-host-loopback", required_argument, NULL,		21 },
>  		{"map-guest-addr", required_argument,	NULL,		22 },
>  		{"dns-host",	required_argument,	NULL,		24 },
> +		{"vhost-user",	no_argument,		NULL,		25 },
> +		/* vhost-user backend program convention */
> +		{"print-capabilities", no_argument,	NULL,		26 },
> +		{"socket-path",	required_argument,	NULL,		's' },
>  		{ 0 },
>  	};
>  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1478,6 +1488,15 @@ void conf(struct ctx *c, int argc, char **argv)
>  				break;
>  
>  			die("Invalid host nameserver address: %s", optarg);
> +		case 25:
> +			if (c->mode == MODE_PASTA) {
> +				err("--vhost-user is for passt mode only");
> +				usage(argv[0], stdout, EXIT_SUCCESS);

This shouldn't exit with EXIT_SUCCESS: it's not supported for pasta so
it's an error (see all the other cases like this one):

				die("--vhost-user is for passt mode only");

> +			}
> +			c->mode = MODE_VU;
> +			break;
> +		case 26:
> +			vu_print_capabilities();
>  			break;
>  		case 'd':
>  			c->debug = 1;
> diff --git a/epoll_type.h b/epoll_type.h
> index 0ad1efa0ccec..f3ef41584757 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -36,6 +36,10 @@ enum epoll_type {
>  	EPOLL_TYPE_TAP_PASST,
>  	/* socket listening for qemu socket connections */
>  	EPOLL_TYPE_TAP_LISTEN,
> +	/* vhost-user command socket */
> +	EPOLL_TYPE_VHOST_CMD,
> +	/* vhost-user kick event socket */
> +	EPOLL_TYPE_VHOST_KICK,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/iov.c b/iov.c
> index 3f9e229a305f..3741db21790f 100644
> --- a/iov.c
> +++ b/iov.c
> @@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
>   *
>   * Returns:    The number of bytes successfully copied.
>   */
> -/* cppcheck-suppress unusedFunction */
>  size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
>  		    size_t offset, const void *buf, size_t bytes)
>  {
> diff --git a/isolation.c b/isolation.c
> index 45fba1e68b9d..c2a3c7b7911d 100644
> --- a/isolation.c
> +++ b/isolation.c
> @@ -379,12 +379,19 @@ void isolate_postfork(const struct ctx *c)
>  
>  	prctl(PR_SET_DUMPABLE, 0);
>  
> -	if (c->mode == MODE_PASTA) {
> -		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> -		prog.filter = filter_pasta;
> -	} else {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
>  		prog.filter = filter_passt;
> +		break;
> +	case MODE_PASTA:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> +		prog.filter = filter_pasta;
> +		break;
> +	case MODE_VU:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
> +		prog.filter = filter_vu;
> +		break;
>  	}
>  
>  	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
> diff --git a/packet.c b/packet.c
> index 37489961a37e..e5a78d079231 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -36,6 +36,17 @@
>  static int packet_check_range(const struct pool *p, size_t offset, size_t len,
>  			      const char *start, const char *func, int line)
>  {
> +	if (p->buf_size == 0) {
> +		int ret;
> +
> +		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
> +
> +		if (ret == -1)
> +			trace("cannot find region, %s:%i", func, line);
> +
> +		return ret;
> +	}
> +
>  	if (start < p->buf) {
>  		trace("packet start %p before buffer start %p, "
>  		      "%s:%i", (void *)start, (void *)p->buf, func, line);
> diff --git a/packet.h b/packet.h
> index 8377dcf678bb..3f70e949c066 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -8,8 +8,10 @@
>  
>  /**
>   * struct pool - Generic pool of packets stored in a buffer
> - * @buf:	Buffer storing packet descriptors
> - * @buf_size:	Total size of buffer
> + * @buf:	Buffer storing packet descriptors,
> + * 		a struct vu_dev_region array for passt vhost-user mode
> + * @buf_size:	Total size of buffer,
> + * 		0 for passt vhost-user mode
>   * @size:	Number of usable descriptors for the pool
>   * @count:	Number of used descriptors for the pool
>   * @pkt:	Descriptors: see macros below
> @@ -22,6 +24,8 @@ struct pool {
>  	struct iovec pkt[1];
>  };
>  
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start);
>  void packet_add_do(struct pool *p, size_t len, const char *start,
>  		   const char *func, int line);
>  void *packet_get_do(const struct pool *p, const size_t idx,
> diff --git a/passt.1 b/passt.1
> index ef33267e9cd7..96532dd39aa2 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -397,12 +397,20 @@ interface address are configured on a given host interface.
>  .SS \fBpasst\fR-only options
>  
>  .TP
> -.BR \-s ", " \-\-socket " " \fIpath
> +.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath
>  Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to connect to
>  \fBpasst\fR.
>  Default is to probe a free socket, not accepting connections, starting from
>  \fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR.
>  
> +.TP
> +.BR \-\-vhost-user
> +Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
> +
> +.TP
> +.BR \-\-print-capabilities
> +Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
> +
>  .TP
>  .BR \-F ", " \-\-fd " " \fIFD
>  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> diff --git a/passt.c b/passt.c
> index 79093ee02d62..2d105e81218d 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -52,6 +52,7 @@
>  #include "arch.h"
>  #include "log.h"
>  #include "tcp_splice.h"
> +#include "vu_common.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -74,6 +75,8 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_TAP_PASTA]		= "/dev/net/tun device",
>  	[EPOLL_TYPE_TAP_PASST]		= "connected qemu socket",
>  	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
> +	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
> +	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -360,6 +363,12 @@ loop:
>  		case EPOLL_TYPE_PING:
>  			icmp_sock_handler(&c, ref);
>  			break;
> +		case EPOLL_TYPE_VHOST_CMD:
> +			vu_control_handler(c.vdev, c.fd_tap, eventmask);
> +			break;
> +		case EPOLL_TYPE_VHOST_KICK:
> +			vu_kick_cb(c.vdev, ref, &now);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 4908ed937dc8..311482d36257 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -25,6 +25,8 @@ union epoll_ref;
>  #include "fwd.h"
>  #include "tcp.h"
>  #include "udp.h"
> +#include "udp_vu.h"
> +#include "vhost_user.h"
>  
>  /* Default address for our end on the tap interface.  Bit 0 of byte 0 must be 0
>   * (unicast) and bit 1 of byte 1 must be 1 (locally administered).  Otherwise
> @@ -94,6 +96,7 @@ struct fqdn {
>  enum passt_modes {
>  	MODE_PASST,
>  	MODE_PASTA,
> +	MODE_VU,
>  };
>  
>  /**
> @@ -228,6 +231,7 @@ struct ip6_ctx {
>   * @freebind:		Allow binding of non-local addresses for forwarding
>   * @low_wmem:		Low probed net.core.wmem_max
>   * @low_rmem:		Low probed net.core.rmem_max
> + * @vdev:		vhost-user device
>   */
>  struct ctx {
>  	enum passt_modes mode;
> @@ -289,6 +293,8 @@ struct ctx {
>  
>  	int low_wmem;
>  	int low_rmem;
> +
> +	struct vu_dev *vdev;
>  };
>  
>  void proto_update_l2_buf(const unsigned char *eth_d,
> diff --git a/pcap.c b/pcap.c
> index 6ee6cdfd261a..718d6ad61732 100644
> --- a/pcap.c
> +++ b/pcap.c
> @@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
>   * @iovcnt:	Number of buffers (@iov entries)
>   * @offset:	Offset of the L2 frame within the full data length
>   */
> -/* cppcheck-suppress unusedFunction */
>  void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
>  {
>  	struct timespec now;
> diff --git a/tap.c b/tap.c
> index 4b826fdf7adc..22d19f1833f7 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -58,6 +58,8 @@
>  #include "packet.h"
>  #include "tap.h"
>  #include "log.h"
> +#include "vhost_user.h"
> +#include "vu_common.h"
>  
>  /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
>  static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
> @@ -78,16 +80,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
>  	struct iovec iov[2];
>  	size_t iovcnt = 0;
>  
> -	if (c->mode == MODE_PASST) {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
>  		iovcnt++;
> -	}
> -
> -	iov[iovcnt].iov_base = (void *)data;
> -	iov[iovcnt].iov_len = l2len;
> -	iovcnt++;
> +		/* fall through */
> +	case MODE_PASTA:
> +		iov[iovcnt].iov_base = (void *)data;
> +		iov[iovcnt].iov_len = l2len;
> +		iovcnt++;
>  
> -	tap_send_frames(c, iov, iovcnt, 1);
> +		tap_send_frames(c, iov, iovcnt, 1);
> +		break;
> +	case MODE_VU:
> +		vu_send_single(c, data, l2len);
> +		break;
> +	}
>  }
>  
>  /**
> @@ -414,10 +422,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  	if (!nframes)
>  		return 0;
>  
> -	if (c->mode == MODE_PASTA)
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
> -	else
> +		break;
> +	case MODE_PASST:
>  		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
> +		break;
> +	case MODE_VU:
> +		/* fall through */
> +	default:
> +		ASSERT(0);
> +	}
>  
>  	if (m < nframes)
>  		debug("tap: failed to send %zu frames of %zu",
> @@ -976,7 +992,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
>   * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>   * @c:		Execution context
>   */
> -static void tap_sock_reset(struct ctx *c)
> +void tap_sock_reset(struct ctx *c)
>  {
>  	info("Client connection closed%s", c->one_off ? ", exiting" : "");
>  
> @@ -987,6 +1003,8 @@ static void tap_sock_reset(struct ctx *c)
>  	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
>  	close(c->fd_tap);
>  	c->fd_tap = -1;
> +	if (c->mode == MODE_VU)
> +		vu_cleanup(c->vdev);
>  }
>  
>  /**
> @@ -1205,6 +1223,11 @@ static void tap_backend_show_hints(struct ctx *c)
>  		info("or qrap, for earlier qemu versions:");
>  		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
>  		break;
> +	case MODE_VU:
> +		info("You can start qemu with:");
> +		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
> +		     c->sock_path);
> +		break;
>  	}
>  }
>  
> @@ -1232,8 +1255,8 @@ static void tap_sock_unix_init(const struct ctx *c)
>   */
>  void tap_listen_handler(struct ctx *c, uint32_t events)
>  {
> -	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
>  	struct epoll_event ev = { 0 };
> +	union epoll_ref ref = { 0 };
>  	int v = INT_MAX / 2;
>  	struct ucred ucred;
>  	socklen_t len;
> @@ -1273,6 +1296,10 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  		trace("tap: failed to set SO_SNDBUF to %i", v);
>  
>  	ref.fd = c->fd_tap;
> +	if (c->mode == MODE_VU)
> +		ref.type = EPOLL_TYPE_VHOST_CMD;
> +	else
> +		ref.type = EPOLL_TYPE_TAP_PASST;
>  	ev.events = EPOLLIN | EPOLLRDHUP;
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> @@ -1339,7 +1366,7 @@ static void tap_sock_tun_init(struct ctx *c)
>   * @base:	Buffer base
>   * @size	Buffer size
>   */
> -static void tap_sock_update_pool(void *base, size_t size)
> +void tap_sock_update_pool(void *base, size_t size)
>  {
>  	int i;
>  
> @@ -1353,13 +1380,15 @@ static void tap_sock_update_pool(void *base, size_t size)
>  }
>  
>  /**
> - * tap_backend_init() - Create and set up AF_UNIX socket or
> - *			tuntap file descriptor
> + * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
>   * @c:		Execution context
>   */
>  void tap_backend_init(struct ctx *c)
>  {
> -	tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
> +	if (c->mode == MODE_VU)
> +		tap_sock_update_pool(NULL, 0);
> +	else
> +		tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
>  
>  	if (c->fd_tap != -1) { /* Passed as --fd */
>  		struct epoll_event ev = { 0 };
> @@ -1367,10 +1396,17 @@ void tap_backend_init(struct ctx *c)
>  
>  		ASSERT(c->one_off);
>  		ref.fd = c->fd_tap;
> -		if (c->mode == MODE_PASST)
> +		switch (c->mode) {
> +		case MODE_PASST:
>  			ref.type = EPOLL_TYPE_TAP_PASST;
> -		else
> +			break;
> +		case MODE_PASTA:
>  			ref.type = EPOLL_TYPE_TAP_PASTA;
> +			break;
> +		case MODE_VU:
> +			ref.type = EPOLL_TYPE_VHOST_CMD;
> +			break;
> +		}
>  
>  		ev.events = EPOLLIN | EPOLLRDHUP;
>  		ev.data.u64 = ref.u64;
> @@ -1378,9 +1414,14 @@ void tap_backend_init(struct ctx *c)
>  		return;
>  	}
>  
> -	if (c->mode == MODE_PASTA) {
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		tap_sock_tun_init(c);
> -	} else {
> +		break;
> +	case MODE_VU:
> +		vu_init(c);
> +		/* fall through */
> +	case MODE_PASST:
>  		tap_sock_unix_init(c);
>  
>  		/* In passt mode, we don't know the guest's MAC address until it
> @@ -1388,6 +1429,7 @@ void tap_backend_init(struct ctx *c)
>  		 * first packets will reach it.
>  		 */
>  		memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
> +		break;

Unrelated change (or left-over).

>  	}
>  
>  	tap_backend_show_hints(c);
> diff --git a/tap.h b/tap.h
> index 8728cc5c09c3..dfbd8b9ebd72 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
>   */
>  static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
>  {
> -	thdr->vnet_len = htonl(l2len);
> +	if (thdr)
> +		thdr->vnet_len = htonl(l2len);
>  }
>  
>  void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
> @@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  int tap_sock_unix_open(char *sock_path);
> +void tap_sock_reset(struct ctx *c);
> +void tap_sock_update_pool(void *base, size_t size);
>  void tap_backend_init(struct ctx *c);
>  void tap_flush_pools(void);
>  void tap_handler(struct ctx *c, const struct timespec *now);
> diff --git a/tcp.c b/tcp.c
> index eae02b1647e3..fd2def0d8a39 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -304,6 +304,7 @@
>  #include "flow_table.h"
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
> +#include "tcp_vu.h"
>  
>  /* MSS rounding: see SET_MSS() */
>  #define MSS_DEFAULT			536
> @@ -1328,6 +1329,9 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
>  static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
>  			 int flags)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_send_flag(c, conn, flags);
> +
>  	return tcp_buf_send_flag(c, conn, flags);
>  }
>  
> @@ -1721,6 +1725,9 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
>   */
>  static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_data_from_sock(c, conn);
> +
>  	return tcp_buf_data_from_sock(c, conn);
>  }
>  
> diff --git a/tcp_vu.c b/tcp_vu.c
> new file mode 100644
> index 000000000000..1126fb39d138
> --- /dev/null
> +++ b/tcp_vu.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* tcp_vu.c - TCP L2 vhost-user management functions
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#include <errno.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +#include <netinet/ip.h>
> +
> +#include <sys/socket.h>
> +
> +#include <linux/tcp.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "vhost_user.h"
> +#include "tcp.h"
> +#include "pcap.h"
> +#include "flow.h"
> +#include "tcp_conn.h"
> +#include "flow_table.h"
> +#include "tcp_vu.h"
> +#include "tap.h"
> +#include "tcp_internal.h"
> +#include "checksum.h"
> +#include "vu_common.h"
> +
> +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
> +static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +
> +/**
> + * tcp_vu_l2_hdrlen() - return the size of the header in level 2 frame (TDP)
> + * @v6:		Set for IPv6 packet
> + *
> + * Return: Return the size of the header
> + */
> +static size_t tcp_vu_l2_hdrlen(bool v6)
> +{
> +	size_t l2_hdrlen;
> +
> +	l2_hdrlen = sizeof(struct ethhdr) + sizeof(struct tcphdr);
> +
> +	if (v6)
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	else
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +	return l2_hdrlen;
> +}
> +
> +/**
> + * tcp_vu_update_check() - Calculate TCP checksum
> + * @tapside:	Address information for one side of the flow
> + * @iov:	Pointer to the array of IO vectors
> + * @iov_used:	Length of the array
> + */
> +static void tcp_vu_update_check(const struct flowside *tapside,
> +			        struct iovec *iov, int iov_used)
> +{
> +	char *base = iov[0].iov_base;
> +
> +	if (inany_v4(&tapside->oaddr)) {
> +		const struct iphdr *iph = vu_ip(base);
> +
> +		tcp_update_check_tcp4(iph, iov, iov_used,
> +				      (char *)vu_payloadv4(base) - base);
> +	} else {
> +		const struct ipv6hdr *ip6h = vu_ip(base);
> +
> +		tcp_update_check_tcp6(ip6h, iov, iov_used,
> +				      (char *)vu_payloadv6(base) - base);
> +	}
> +}
> +
> +/**
> + * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload)
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @flags:	TCP flags: if not set, send segment only if ACK is due
> + *
> + * Return: negative error code on connection reset, 0 otherwise
> + */
> +int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	size_t l2len, l4len, optlen, hdrlen;
> +	struct ethhdr *eh;
> +	int elem_cnt;
> +	int nb_ack;
> +	int ret;
> +
> +	hdrlen = tcp_vu_l2_hdrlen(CONN_V6(conn));
> +
> +	vu_init_elem(elem, iov_vu, 2);
> +
> +	elem_cnt = vu_collect_one_frame(vdev, vq, elem, 1,
> +					hdrlen + OPT_MSS_LEN + OPT_WS_LEN + 1,
> +					0, NULL);
> +	if (elem_cnt < 1)
> +		return 0;
> +
> +	vu_set_vnethdr(vdev, &iov_vu[0], 1, 0);
> +
> +	eh = vu_eth(iov_vu[0].iov_base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	if (CONN_V4(conn)) {
> +		struct tcp_payload_t *payload;
> +		struct iphdr *iph;
> +		uint32_t seq;
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = vu_ip(iov_vu[0].iov_base);
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv4(iov_vu[0].iov_base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
> +		payload->th.ack = 1;
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
> +					(char *)payload->data, &optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_fill_headers4(conn, NULL, iph, payload, optlen,
> +					  NULL, seq, true);
> +		l2len = sizeof(*iph);
> +	} else {
> +		struct tcp_payload_t *payload;
> +		struct ipv6hdr *ip6h;
> +		uint32_t seq;
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = vu_ip(iov_vu[0].iov_base);
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv6(iov_vu[0].iov_base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
> +		payload->th.ack = 1;
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
> +					(char *)payload->data, &optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_fill_headers6(conn, NULL, ip6h, payload, optlen,
> +					  seq, true);
> +		l2len = sizeof(*ip6h);
> +	}
> +	l2len += l4len + sizeof(struct ethhdr);
> +
> +	elem[0].in_sg[0].iov_len = l2len +
> +				   sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	if (*c->pcap) {
> +		tcp_vu_update_check(tapside, &elem[0].in_sg[0], 1);
> +		pcap_iov(&elem[0].in_sg[0], 1,
> +			 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +	}
> +	nb_ack = 1;
> +
> +	if (flags & DUP_ACK) {
> +		elem_cnt = vu_collect_one_frame(vdev, vq, &elem[1], 1, l2len,
> +						0, NULL);
> +		if (elem_cnt == 1) {
> +			memcpy(elem[1].in_sg[0].iov_base,
> +			       elem[0].in_sg[0].iov_base, l2len);
> +			vu_set_vnethdr(vdev, &elem[1].in_sg[0], 1, 0);
> +			nb_ack++;
> +
> +			if (*c->pcap)
> +				pcap_iov(&elem[1].in_sg[0], 1, 0);
> +		}
> +	}
> +
> +	vu_flush(vdev, vq, elem, nb_ack);
> +
> +	return 0;
> +}
> +
> +/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user buffers
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @v4:		Set for IPv4 connections
> + * @fillsize:	Number of bytes we can receive
> + * @datalen:	Size of received data (output)
> + *
> + * Return: Number of iov entries used to store the data

, or negative error code

> + */
> +static ssize_t tcp_vu_sock_recv(const struct ctx *c,
> +				struct tcp_tap_conn *conn, bool v4,
> +				size_t fillsize, ssize_t *dlen)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct msghdr mh_sock = { 0 };
> +	uint16_t mss = MSS_GET(conn);
> +	int s = conn->sock;
> +	size_t l2_hdrlen;
> +	int elem_cnt;
> +	ssize_t ret;
> +
> +	*dlen = 0;
> +
> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> +
> +	vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
> +
> +	elem_cnt = vu_collect(vdev, vq, elem, VIRTQUEUE_MAX_SIZE, mss,
> +			      l2_hdrlen, fillsize);
> +	if (elem_cnt < 0) {
> +		tcp_rst(c, conn);
> +		return -ENOMEM;

On top of what David mentioned (I don't think this warrants a tcp_rst()
either), ENOMEM means "out of memory", ENOBUFS (no buffer space
available) is probably more appropriate here.

We're not failing to allocate anything, it's just that there are no
buffers left.

> +	}
> +
> +	mh_sock.msg_iov = iov_vu;
> +	mh_sock.msg_iovlen = elem_cnt + 1;
> +
> +	do
> +		ret = recvmsg(s, &mh_sock, MSG_PEEK);
> +	while (ret < 0 && errno == EINTR);
> +
> +	if (ret < 0) {
> +		vu_queue_rewind(vq, elem_cnt);
> +		if (errno != EAGAIN && errno != EWOULDBLOCK) {
> +			ret = -errno;
> +			tcp_rst(c, conn);
> +		}
> +		return ret;
> +	}
> +	if (!ret) {
> +		vu_queue_rewind(vq, elem_cnt);
> +
> +		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> +			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);
> +			if (retf) {
> +				tcp_rst(c, conn);
> +				return retf;
> +			}
> +
> +			conn_event(c, conn, TAP_FIN_SENT);
> +		}
> +		return 0;
> +	}
> +
> +	*dlen = ret;
> +
> +	return elem_cnt;
> +}
> +
> +/**
> + * tcp_vu_prepare() - Prepare the packet header
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @first:	Pointer to the array of IO vectors
> + * @dlen:	Packet data length
> + * @check:	Checksum, if already known
> + */
> +static void tcp_vu_prepare(const struct ctx *c,
> +			   struct tcp_tap_conn *conn, struct iovec *first,
> +			   size_t dlen, const uint16_t **check)
> +{
> +	const struct flowside *toside = TAPFLOW(conn);
> +	char *base = first->iov_base;
> +	struct ethhdr *eh;
> +
> +	/* we guess the first iovec provided by the guest can embed
> +	 * all the headers needed by L2 frame
> +	 */
> +
> +	eh = vu_eth(base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	/* initialize header */
> +	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
> +		struct tcp_payload_t *payload;
> +		struct iphdr *iph;
> +
> +		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
> +		       sizeof(struct ethhdr) + sizeof(struct iphdr) +
> +		       sizeof(struct tcphdr));
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = vu_ip(base);
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +		payload = vu_payloadv4(base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
> +		payload->th.ack = 1;
> +
> +		tcp_fill_headers4(conn, NULL, iph, payload, dlen,
> +				  *check, conn->seq_to_tap, true);
> +		*check = &iph->check;
> +	} else {
> +		struct tcp_payload_t *payload;
> +		struct ipv6hdr *ip6h;
> +
> +		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
> +		       sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
> +		       sizeof(struct tcphdr));
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = vu_ip(base);
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv6(base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
> +		payload->th.ack = 1;
> +
> +		tcp_fill_headers6(conn, NULL, ip6h, payload, dlen,
> +				  conn->seq_to_tap, true);
> +	}
> +}
> +
> +/**
> + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
> + *			     in window
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + *
> + * Return: Negative on connection reset, 0 otherwise
> + */
> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> +{
> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	uint16_t mss = MSS_GET(conn);
> +	size_t l2_hdrlen, fillsize;
> +	int i, iov_cnt, iov_used;
> +	int v4 = CONN_V4(conn);
> +	uint32_t already_sent = 0;
> +	const uint16_t *check;
> +	struct iovec *first;
> +	int frame_size;
> +	int num_buffers;
> +	ssize_t len;
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		flow_err(conn,
> +			 "Got packet, but RX virtqueue not usable yet");
> +		return 0;
> +	}
> +
> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> +
> +	if (SEQ_LT(already_sent, 0)) {
> +		/* RFC 761, section 2.1. */
> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
> +		conn->seq_to_tap = conn->seq_ack_from_tap;
> +		already_sent = 0;
> +	}
> +
> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
> +		conn_flag(c, conn, STALLED);
> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +		return 0;
> +	}
> +
> +	/* Set up buffer descriptors we'll fill completely and partially. */
> +
> +	fillsize = wnd_scaled;
> +
> +	if (peek_offset_cap)
> +		already_sent = 0;
> +
> +	iov_vu[0].iov_base = tcp_buf_discard;
> +	iov_vu[0].iov_len = already_sent;

I think I had a similar comment to a previous revision. Now, I haven't
tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
I think this should eventually follow the same logic as the (updated)
tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
(!peek_offset_cap).

It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
starting from 1, for simplicity. But I'm not sure if it's safe to pass a
zero iov_len if (peek_offset_cap).

I'll test that (unless you already did) -- if it works, we can fix this
up later as well.

> +	fillsize -= already_sent;
> +
> +	/* collect the buffers from vhost-user and fill them with the
> +	 * data from the socket
> +	 */
> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
> +	if (iov_cnt <= 0)
> +		return iov_cnt;
> +
> +	len -= already_sent;
> +	if (len <= 0) {
> +		conn_flag(c, conn, STALLED);
> +		vu_queue_rewind(vq, iov_cnt);
> +		return 0;
> +	}
> +
> +	conn_flag(c, conn, ~STALLED);
> +
> +	/* Likely, some new data was acked too. */
> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
> +
> +	/* initialize headers */
> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> +	iov_used = 0;
> +	num_buffers = 0;
> +	check = NULL;
> +	frame_size = 0;
> +
> +	/* iov_vu is an array of buffers and the buffer size can be
> +	 * smaller than the frame size we want to use but with
> +	 * num_buffer we can merge several virtio iov buffers in one packet
> +	 * we need only to set the packet headers in the first iov and
> +	 * num_buffer to the number of iov entries
> +	 */
> +	for (i = 0; i < iov_cnt && len; i++) {
> +
> +		if (frame_size == 0)
> +			first = &iov_vu[i + 1];
> +
> +		if (iov_vu[i + 1].iov_len > (size_t)len)
> +			iov_vu[i + 1].iov_len = len;
> +
> +		len -= iov_vu[i + 1].iov_len;
> +		iov_used++;
> +
> +		frame_size += iov_vu[i + 1].iov_len;
> +		num_buffers++;
> +
> +		if (frame_size >= mss || len == 0 ||
> +		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +			if (i + 1 == iov_cnt)
> +				check = NULL;
> +
> +			/* restore first iovec base: point to vnet header */
> +			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
> +
> +			tcp_vu_prepare(c, conn, first, frame_size, &check);
> +			if (*c->pcap)  {

Nit: excess whitespace.

> +				tcp_vu_update_check(tapside, first, num_buffers);
> +				pcap_iov(first, num_buffers,
> +					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +			}
> +
> +			conn->seq_to_tap += frame_size;
> +
> +			frame_size = 0;
> +			num_buffers = 0;
> +		}
> +	}
> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> +
> +	/* send packets */
> +	vu_flush(vdev, vq, elem, iov_used);
> +
> +	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +
> +	return 0;
> +}

Minus those and David's comments, it looks good to me until this point
-- I'm still reviewing the UDP part and the common_vu.c functions.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-15 19:54   ` Stefano Brivio
@ 2024-10-16  0:41     ` David Gibson
  2024-10-17  0:10       ` Stefano Brivio
  0 siblings, 1 reply; 50+ messages in thread
From: David Gibson @ 2024-10-16  0:41 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1129 bytes --]

On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:
> [Still partial review]
[snip]
> > +	if (peek_offset_cap)
> > +		already_sent = 0;
> > +
> > +	iov_vu[0].iov_base = tcp_buf_discard;
> > +	iov_vu[0].iov_len = already_sent;
> 
> I think I had a similar comment to a previous revision. Now, I haven't
> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> I think this should eventually follow the same logic as the (updated)
> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> (!peek_offset_cap).
> 
> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> zero iov_len if (peek_offset_cap).

> I'll test that (unless you already did) -- if it works, we can fix this
> up later as well.

I believe I tested it at some point, and I think we're already using
it somewhere.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 8/8] test: Add tests for passt in vhost-user mode
  2024-10-15 19:54   ` Stefano Brivio
@ 2024-10-16  8:06     ` Laurent Vivier
  2024-10-16  9:47       ` Stefano Brivio
  0 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-10-16  8:06 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 15/10/2024 21:54, Stefano Brivio wrote:
> On Thu, 10 Oct 2024 14:29:02 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> From: Stefano Brivio <sbrivio@redhat.com>
>>
>> Run functional and performance tests for vhost-user mode as well. For
>> functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
>> to their non-vhost-user counterparts, as no differences are intended
>> but we want to distinguish them in test logs.
>>
>> [...]
>>
>> diff --git a/test/two_guests_vu b/test/two_guests_vu
>> new file mode 120000
>> index 000000000000..144b7cac5438
>> --- /dev/null
>> +++ b/test/two_guests_vu
>> @@ -0,0 +1 @@
>> +test/two_guests
>> \ No newline at end of file
> 
> Oops, this link is wrong: it works if you execute the tests from the
> top git directory, but not if you run them from test/. It should simply
> point to 'two_guests', not to 'test/two_guests'.
> 
> For some reason, even with that fixed, v8 fails for me on that test:
> 'dhclient -6 eth0' gets stuck in guest #2 (it works in guest #1). I
> still need to debug that. I'm fairly sure that v6 or v7 worked.
> 

Perhaps because of this:

diff --git a/test/lib/setup b/test/lib/setup
index 3409bd29cd81..580825f1f9a7 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -264,7 +264,7 @@ setup_two_guests() {
                         -device virtio-net,netdev=v                            \
                         -object memory-backend-memfd,id=m,share=on,size=${__vmem} \
                         -numa node,memdev=m"
-               __qemu_netdev1="                                               \
+               __qemu_netdev2="                                               \
                         -chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
                         -netdev vhost-user,id=v,chardev=c                      \
                         -device virtio-net,netdev=v                            \

Thanks,
Laurent


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 8/8] test: Add tests for passt in vhost-user mode
  2024-10-16  8:06     ` Laurent Vivier
@ 2024-10-16  9:47       ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-16  9:47 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Wed, 16 Oct 2024 10:06:21 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> On 15/10/2024 21:54, Stefano Brivio wrote:
> > On Thu, 10 Oct 2024 14:29:02 +0200
> > Laurent Vivier <lvivier@redhat.com> wrote:
> >   
> >> From: Stefano Brivio <sbrivio@redhat.com>
> >>
> >> Run functional and performance tests for vhost-user mode as well. For
> >> functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
> >> to their non-vhost-user counterparts, as no differences are intended
> >> but we want to distinguish them in test logs.
> >>
> >> [...]
> >>
> >> diff --git a/test/two_guests_vu b/test/two_guests_vu
> >> new file mode 120000
> >> index 000000000000..144b7cac5438
> >> --- /dev/null
> >> +++ b/test/two_guests_vu
> >> @@ -0,0 +1 @@
> >> +test/two_guests
> >> \ No newline at end of file  
> > 
> > Oops, this link is wrong: it works if you execute the tests from the
> > top git directory, but not if you run them from test/. It should simply
> > point to 'two_guests', not to 'test/two_guests'.
> > 
> > For some reason, even with that fixed, v8 fails for me on that test:
> > 'dhclient -6 eth0' gets stuck in guest #2 (it works in guest #1). I
> > still need to debug that. I'm fairly sure that v6 or v7 worked.
> 
> Perhaps because of this:
> 
> diff --git a/test/lib/setup b/test/lib/setup
> index 3409bd29cd81..580825f1f9a7 100755
> --- a/test/lib/setup
> +++ b/test/lib/setup
> @@ -264,7 +264,7 @@ setup_two_guests() {
>                          -device virtio-net,netdev=v                            \
>                          -object memory-backend-memfd,id=m,share=on,size=${__vmem} \
>                          -numa node,memdev=m"
> -               __qemu_netdev1="                                               \
> +               __qemu_netdev2="                                               \
>                          -chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
>                          -netdev vhost-user,id=v,chardev=c                      \
>                          -device virtio-net,netdev=v                            \

Oops, right, nice catch. All tests pass for me now.

It was broken already in the first version I sent, I wonder how it ever
worked for me... I guess I had some local changes on top.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-15  3:23   ` David Gibson
@ 2024-10-16 10:07     ` Laurent Vivier
  2024-10-16 16:26       ` Stefano Brivio
  0 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-10-16 10:07 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 15/10/2024 05:23, David Gibson wrote:
>> +/**
>> + * tcp_vu_update_check() - Calculate TCP checksum
>> + * @tapside:	Address information for one side of the flow
>> + * @iov:	Pointer to the array of IO vectors
>> + * @iov_used:	Length of the array
>> + */
>> +static void tcp_vu_update_check(const struct flowside *tapside,
>> +			        struct iovec *iov, int iov_used)
> AFAICT this is only used for the pcap path.  Rather than filling in
> the checksum at a different point from normal, I think it would be
> easier to just clear the no_tcp_csum flag when pcap is enabled.  That
> would, AFAICT, remove the need for this function entirely.

To do that is a little bit complicated because we need to pass the iov array to 
tcp_fill_headers4()/tcp_fill_headers6() to be able to compute the checksum of the TCP part.

In tcp_buf, the TCP header and TCP payload are in the same iovec but with tcp_vu they can 
be split on several iovecs. And if we provide the iovec , theoretically we should not 
provide the TAP, IP and TCP headers via the parameters as they are in the iovec, but for 
tcp_buf they have one iovec each, and with tcp_vu they are probably all in the same iovec 
(the first one). So, again, it can be complicated to extract headers to update them.

In conclusion, I update the checksum only for VU and in the case of pcap because it is 
simpler (the same logic applies for udp_update_hdr4()/udp_update_hdr6()).

I'm open to any suggestion that could do the checksum in 
udp_update_hdr4()/udp_update_hdr6() (I agree with you, it should be the place to do it) 
but I don't see an easy and nice way to implement it.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-16 10:07     ` Laurent Vivier
@ 2024-10-16 16:26       ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-16 16:26 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Wed, 16 Oct 2024 12:07:21 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> On 15/10/2024 05:23, David Gibson wrote:
> >> +/**
> >> + * tcp_vu_update_check() - Calculate TCP checksum
> >> + * @tapside:	Address information for one side of the flow
> >> + * @iov:	Pointer to the array of IO vectors
> >> + * @iov_used:	Length of the array
> >> + */
> >> +static void tcp_vu_update_check(const struct flowside *tapside,
> >> +			        struct iovec *iov, int iov_used)  
> > AFAICT this is only used for the pcap path.  Rather than filling in
> > the checksum at a different point from normal, I think it would be
> > easier to just clear the no_tcp_csum flag when pcap is enabled.  That
> > would, AFAICT, remove the need for this function entirely.  
> 
> To do that is a little bit complicated because we need to pass the iov array to 
> tcp_fill_headers4()/tcp_fill_headers6() to be able to compute the checksum of the TCP part.
> 
> In tcp_buf, the TCP header and TCP payload are in the same iovec but with tcp_vu they can 
> be split on several iovecs. And if we provide the iovec , theoretically we should not 
> provide the TAP, IP and TCP headers via the parameters as they are in the iovec, but for 
> tcp_buf they have one iovec each, and with tcp_vu they are probably all in the same iovec 
> (the first one). So, again, it can be complicated to extract headers to update them.
> 
> In conclusion, I update the checksum only for VU and in the case of pcap because it is 
> simpler (the same logic applies for udp_update_hdr4()/udp_update_hdr6()).
> 
> I'm open to any suggestion that could do the checksum in 
> udp_update_hdr4()/udp_update_hdr6() (I agree with you, it should be the place to do it) 
> but I don't see an easy and nice way to implement it.

I don't really have a suggestion here, but this is probably the first
use case we met where switching to the pcapng format would make sense,
as we could just mark the checksum as "offloaded" in this case:

  https://ietf-opsawg-wg.github.io/draft-ietf-opsawg-pcap/draft-ietf-opsawg-pcapng.html#section-4.3.1 

On the other hand, it comes with a lot of complexity in my opinion so I
still think that it wouldn't make sense in general.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-16  0:41     ` David Gibson
@ 2024-10-17  0:10       ` Stefano Brivio
  2024-10-17 11:25         ` Stefano Brivio
  2024-10-22 12:59         ` Laurent Vivier
  0 siblings, 2 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-17  0:10 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Wed, 16 Oct 2024 11:41:34 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:
> > [Still partial review]  
> [snip]
> > > +	if (peek_offset_cap)
> > > +		already_sent = 0;
> > > +
> > > +	iov_vu[0].iov_base = tcp_buf_discard;
> > > +	iov_vu[0].iov_len = already_sent;  
> > 
> > I think I had a similar comment to a previous revision. Now, I haven't
> > tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> > I think this should eventually follow the same logic as the (updated)
> > tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> > (!peek_offset_cap).
> > 
> > It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> > starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> > zero iov_len if (peek_offset_cap).  
> 
> > I'll test that (unless you already did) -- if it works, we can fix this
> > up later as well.  
> 
> I believe I tested it at some point, and I think we're already using
> it somewhere.

I tested it again just to be sure on a recent net.git kernel: sometimes
the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
(using loopback address): big transfer" test instead.

I can reproduce at least one of the two issues consistently (tests
stopped 5 times out of 5).

The socat client completes the transfer, the server is still waiting
for something. I haven't taken captures yet or tried to re-send from
the client.

It all works (consistently) with an older kernel without support for
SO_PEEK_OFF on TCP, but also on this kernel if I force peek_offset_cap
to false in tcp_init().

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-10 12:29 ` [PATCH v8 7/8] vhost-user: add vhost-user Laurent Vivier
  2024-10-15  3:23   ` David Gibson
  2024-10-15 19:54   ` Stefano Brivio
@ 2024-10-17  0:10   ` Stefano Brivio
  2024-10-17  7:28     ` Laurent Vivier
                       ` (3 more replies)
  2 siblings, 4 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-17  0:10 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 10 Oct 2024 14:29:01 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> add virtio and vhost-user functions to connect with QEMU.
> 
>   $ ./passt --vhost-user
> 
> and
> 
>   # qemu-system-x86_64 ... -m 4G \
>         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>         -numa node,memdev=memfd0 \
>         -chardev socket,id=chr0,path=/tmp/passt_1.socket \
>         -netdev vhost-user,id=netdev0,chardev=chr0 \
>         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>         ...
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     |   6 +-
>  conf.c       |  21 ++-
>  epoll_type.h |   4 +
>  iov.c        |   1 -
>  isolation.c  |  15 +-
>  packet.c     |  11 ++
>  packet.h     |   8 +-
>  passt.1      |  10 +-
>  passt.c      |   9 +
>  passt.h      |   6 +
>  pcap.c       |   1 -
>  tap.c        |  80 +++++++--
>  tap.h        |   5 +-
>  tcp.c        |   7 +
>  tcp_vu.c     | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcp_vu.h     |  12 ++
>  udp.c        |  10 ++
>  udp_vu.c     | 336 ++++++++++++++++++++++++++++++++++++
>  udp_vu.h     |  13 ++
>  vhost_user.c |  37 ++--
>  vhost_user.h |   4 +-
>  virtio.c     |   5 -
>  vu_common.c  | 385 +++++++++++++++++++++++++++++++++++++++++
>  vu_common.h  |  47 +++++
>  24 files changed, 1454 insertions(+), 55 deletions(-)
>  create mode 100644 tcp_vu.c
>  create mode 100644 tcp_vu.h
>  create mode 100644 udp_vu.c
>  create mode 100644 udp_vu.h
>  create mode 100644 vu_common.c
>  create mode 100644 vu_common.h
> 
> diff --git a/Makefile b/Makefile
> index 0e8ed60a0da1..1e8910dda1f4 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -54,7 +54,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
> +	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> +	vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -64,7 +65,8 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h vhost_user.h virtio.h
> +	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> +	virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/conf.c b/conf.c
> index c63101970155..29d6e41f5770 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -45,6 +45,7 @@
>  #include "lineread.h"
>  #include "isolation.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /**
>   * next_chunk - Return the next piece of a string delimited by a character
> @@ -762,9 +763,14 @@ static void usage(const char *name, FILE *f, int status)
>  			"    default: same interface name as external one\n");
>  	} else {
>  		fprintf(f,
> -			"  -s, --socket PATH	UNIX domain socket path\n"
> +			"  -s, --socket, --socket-path PATH	UNIX domain socket path\n"
>  			"    default: probe free path starting from "
>  			UNIX_SOCK_PATH "\n", 1);
> +		fprintf(f,
> +			"  --vhost-user		Enable vhost-user mode\n"
> +			"    UNIX domain socket is provided by -s option\n"
> +			"  --print-capabilities	print back-end capabilities in JSON format,\n"
> +			"    only meaningful for vhost-user mode\n");
>  	}
>  
>  	fprintf(f,
> @@ -1290,6 +1296,10 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"map-host-loopback", required_argument, NULL,		21 },
>  		{"map-guest-addr", required_argument,	NULL,		22 },
>  		{"dns-host",	required_argument,	NULL,		24 },
> +		{"vhost-user",	no_argument,		NULL,		25 },
> +		/* vhost-user backend program convention */
> +		{"print-capabilities", no_argument,	NULL,		26 },
> +		{"socket-path",	required_argument,	NULL,		's' },
>  		{ 0 },
>  	};
>  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1478,6 +1488,15 @@ void conf(struct ctx *c, int argc, char **argv)
>  				break;
>  
>  			die("Invalid host nameserver address: %s", optarg);
> +		case 25:
> +			if (c->mode == MODE_PASTA) {
> +				err("--vhost-user is for passt mode only");
> +				usage(argv[0], stdout, EXIT_SUCCESS);
> +			}
> +			c->mode = MODE_VU;
> +			break;
> +		case 26:
> +			vu_print_capabilities();
>  			break;
>  		case 'd':
>  			c->debug = 1;
> diff --git a/epoll_type.h b/epoll_type.h
> index 0ad1efa0ccec..f3ef41584757 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -36,6 +36,10 @@ enum epoll_type {
>  	EPOLL_TYPE_TAP_PASST,
>  	/* socket listening for qemu socket connections */
>  	EPOLL_TYPE_TAP_LISTEN,
> +	/* vhost-user command socket */
> +	EPOLL_TYPE_VHOST_CMD,
> +	/* vhost-user kick event socket */
> +	EPOLL_TYPE_VHOST_KICK,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/iov.c b/iov.c
> index 3f9e229a305f..3741db21790f 100644
> --- a/iov.c
> +++ b/iov.c
> @@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
>   *
>   * Returns:    The number of bytes successfully copied.
>   */
> -/* cppcheck-suppress unusedFunction */
>  size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
>  		    size_t offset, const void *buf, size_t bytes)
>  {
> diff --git a/isolation.c b/isolation.c
> index 45fba1e68b9d..c2a3c7b7911d 100644
> --- a/isolation.c
> +++ b/isolation.c
> @@ -379,12 +379,19 @@ void isolate_postfork(const struct ctx *c)
>  
>  	prctl(PR_SET_DUMPABLE, 0);
>  
> -	if (c->mode == MODE_PASTA) {
> -		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> -		prog.filter = filter_pasta;
> -	} else {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
>  		prog.filter = filter_passt;
> +		break;
> +	case MODE_PASTA:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> +		prog.filter = filter_pasta;
> +		break;
> +	case MODE_VU:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
> +		prog.filter = filter_vu;
> +		break;
>  	}
>  
>  	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
> diff --git a/packet.c b/packet.c
> index 37489961a37e..e5a78d079231 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -36,6 +36,17 @@
>  static int packet_check_range(const struct pool *p, size_t offset, size_t len,
>  			      const char *start, const char *func, int line)
>  {
> +	if (p->buf_size == 0) {
> +		int ret;
> +
> +		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
> +
> +		if (ret == -1)
> +			trace("cannot find region, %s:%i", func, line);
> +
> +		return ret;
> +	}
> +
>  	if (start < p->buf) {
>  		trace("packet start %p before buffer start %p, "
>  		      "%s:%i", (void *)start, (void *)p->buf, func, line);
> diff --git a/packet.h b/packet.h
> index 8377dcf678bb..3f70e949c066 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -8,8 +8,10 @@
>  
>  /**
>   * struct pool - Generic pool of packets stored in a buffer
> - * @buf:	Buffer storing packet descriptors
> - * @buf_size:	Total size of buffer
> + * @buf:	Buffer storing packet descriptors,
> + * 		a struct vu_dev_region array for passt vhost-user mode
> + * @buf_size:	Total size of buffer,
> + * 		0 for passt vhost-user mode
>   * @size:	Number of usable descriptors for the pool
>   * @count:	Number of used descriptors for the pool
>   * @pkt:	Descriptors: see macros below
> @@ -22,6 +24,8 @@ struct pool {
>  	struct iovec pkt[1];
>  };
>  
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start);
>  void packet_add_do(struct pool *p, size_t len, const char *start,
>  		   const char *func, int line);
>  void *packet_get_do(const struct pool *p, const size_t idx,
> diff --git a/passt.1 b/passt.1
> index ef33267e9cd7..96532dd39aa2 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -397,12 +397,20 @@ interface address are configured on a given host interface.
>  .SS \fBpasst\fR-only options
>  
>  .TP
> -.BR \-s ", " \-\-socket " " \fIpath
> +.BR \-s ", " \-\-socket-path ", " \-\-socket " " \fIpath
>  Path for UNIX domain socket used by \fBqemu\fR(1) or \fBqrap\fR(1) to connect to
>  \fBpasst\fR.
>  Default is to probe a free socket, not accepting connections, starting from
>  \fI/tmp/passt_1.socket\fR to \fI/tmp/passt_64.socket\fR.
>  
> +.TP
> +.BR \-\-vhost-user
> +Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
> +
> +.TP
> +.BR \-\-print-capabilities
> +Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
> +
>  .TP
>  .BR \-F ", " \-\-fd " " \fIFD
>  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> diff --git a/passt.c b/passt.c
> index 79093ee02d62..2d105e81218d 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -52,6 +52,7 @@
>  #include "arch.h"
>  #include "log.h"
>  #include "tcp_splice.h"
> +#include "vu_common.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -74,6 +75,8 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_TAP_PASTA]		= "/dev/net/tun device",
>  	[EPOLL_TYPE_TAP_PASST]		= "connected qemu socket",
>  	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
> +	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
> +	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -360,6 +363,12 @@ loop:
>  		case EPOLL_TYPE_PING:
>  			icmp_sock_handler(&c, ref);
>  			break;
> +		case EPOLL_TYPE_VHOST_CMD:
> +			vu_control_handler(c.vdev, c.fd_tap, eventmask);
> +			break;
> +		case EPOLL_TYPE_VHOST_KICK:
> +			vu_kick_cb(c.vdev, ref, &now);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 4908ed937dc8..311482d36257 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -25,6 +25,8 @@ union epoll_ref;
>  #include "fwd.h"
>  #include "tcp.h"
>  #include "udp.h"
> +#include "udp_vu.h"
> +#include "vhost_user.h"
>  
>  /* Default address for our end on the tap interface.  Bit 0 of byte 0 must be 0
>   * (unicast) and bit 1 of byte 1 must be 1 (locally administered).  Otherwise
> @@ -94,6 +96,7 @@ struct fqdn {
>  enum passt_modes {
>  	MODE_PASST,
>  	MODE_PASTA,
> +	MODE_VU,
>  };
>  
>  /**
> @@ -228,6 +231,7 @@ struct ip6_ctx {
>   * @freebind:		Allow binding of non-local addresses for forwarding
>   * @low_wmem:		Low probed net.core.wmem_max
>   * @low_rmem:		Low probed net.core.rmem_max
> + * @vdev:		vhost-user device
>   */
>  struct ctx {
>  	enum passt_modes mode;
> @@ -289,6 +293,8 @@ struct ctx {
>  
>  	int low_wmem;
>  	int low_rmem;
> +
> +	struct vu_dev *vdev;
>  };
>  
>  void proto_update_l2_buf(const unsigned char *eth_d,
> diff --git a/pcap.c b/pcap.c
> index 6ee6cdfd261a..718d6ad61732 100644
> --- a/pcap.c
> +++ b/pcap.c
> @@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
>   * @iovcnt:	Number of buffers (@iov entries)
>   * @offset:	Offset of the L2 frame within the full data length
>   */
> -/* cppcheck-suppress unusedFunction */
>  void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset)
>  {
>  	struct timespec now;
> diff --git a/tap.c b/tap.c
> index 4b826fdf7adc..22d19f1833f7 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -58,6 +58,8 @@
>  #include "packet.h"
>  #include "tap.h"
>  #include "log.h"
> +#include "vhost_user.h"
> +#include "vu_common.h"
>  
>  /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
>  static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
> @@ -78,16 +80,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
>  	struct iovec iov[2];
>  	size_t iovcnt = 0;
>  
> -	if (c->mode == MODE_PASST) {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
>  		iovcnt++;
> -	}
> -
> -	iov[iovcnt].iov_base = (void *)data;
> -	iov[iovcnt].iov_len = l2len;
> -	iovcnt++;
> +		/* fall through */
> +	case MODE_PASTA:
> +		iov[iovcnt].iov_base = (void *)data;
> +		iov[iovcnt].iov_len = l2len;
> +		iovcnt++;
>  
> -	tap_send_frames(c, iov, iovcnt, 1);
> +		tap_send_frames(c, iov, iovcnt, 1);
> +		break;
> +	case MODE_VU:
> +		vu_send_single(c, data, l2len);
> +		break;
> +	}
>  }
>  
>  /**
> @@ -414,10 +422,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  	if (!nframes)
>  		return 0;
>  
> -	if (c->mode == MODE_PASTA)
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
> -	else
> +		break;
> +	case MODE_PASST:
>  		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
> +		break;
> +	case MODE_VU:
> +		/* fall through */
> +	default:
> +		ASSERT(0);
> +	}
>  
>  	if (m < nframes)
>  		debug("tap: failed to send %zu frames of %zu",
> @@ -976,7 +992,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
>   * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>   * @c:		Execution context
>   */
> -static void tap_sock_reset(struct ctx *c)
> +void tap_sock_reset(struct ctx *c)
>  {
>  	info("Client connection closed%s", c->one_off ? ", exiting" : "");
>  
> @@ -987,6 +1003,8 @@ static void tap_sock_reset(struct ctx *c)
>  	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
>  	close(c->fd_tap);
>  	c->fd_tap = -1;
> +	if (c->mode == MODE_VU)
> +		vu_cleanup(c->vdev);
>  }
>  
>  /**
> @@ -1205,6 +1223,11 @@ static void tap_backend_show_hints(struct ctx *c)
>  		info("or qrap, for earlier qemu versions:");
>  		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
>  		break;
> +	case MODE_VU:
> +		info("You can start qemu with:");
> +		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
> +		     c->sock_path);
> +		break;
>  	}
>  }
>  
> @@ -1232,8 +1255,8 @@ static void tap_sock_unix_init(const struct ctx *c)
>   */
>  void tap_listen_handler(struct ctx *c, uint32_t events)
>  {
> -	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
>  	struct epoll_event ev = { 0 };
> +	union epoll_ref ref = { 0 };
>  	int v = INT_MAX / 2;
>  	struct ucred ucred;
>  	socklen_t len;
> @@ -1273,6 +1296,10 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  		trace("tap: failed to set SO_SNDBUF to %i", v);
>  
>  	ref.fd = c->fd_tap;
> +	if (c->mode == MODE_VU)
> +		ref.type = EPOLL_TYPE_VHOST_CMD;
> +	else
> +		ref.type = EPOLL_TYPE_TAP_PASST;
>  	ev.events = EPOLLIN | EPOLLRDHUP;
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> @@ -1339,7 +1366,7 @@ static void tap_sock_tun_init(struct ctx *c)
>   * @base:	Buffer base
>   * @size	Buffer size
>   */
> -static void tap_sock_update_pool(void *base, size_t size)
> +void tap_sock_update_pool(void *base, size_t size)
>  {
>  	int i;
>  
> @@ -1353,13 +1380,15 @@ static void tap_sock_update_pool(void *base, size_t size)
>  }
>  
>  /**
> - * tap_backend_init() - Create and set up AF_UNIX socket or
> - *			tuntap file descriptor
> + * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
>   * @c:		Execution context
>   */
>  void tap_backend_init(struct ctx *c)
>  {
> -	tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
> +	if (c->mode == MODE_VU)
> +		tap_sock_update_pool(NULL, 0);
> +	else
> +		tap_sock_update_pool(pkt_buf, sizeof(pkt_buf));
>  
>  	if (c->fd_tap != -1) { /* Passed as --fd */
>  		struct epoll_event ev = { 0 };
> @@ -1367,10 +1396,17 @@ void tap_backend_init(struct ctx *c)
>  
>  		ASSERT(c->one_off);
>  		ref.fd = c->fd_tap;
> -		if (c->mode == MODE_PASST)
> +		switch (c->mode) {
> +		case MODE_PASST:
>  			ref.type = EPOLL_TYPE_TAP_PASST;
> -		else
> +			break;
> +		case MODE_PASTA:
>  			ref.type = EPOLL_TYPE_TAP_PASTA;
> +			break;
> +		case MODE_VU:
> +			ref.type = EPOLL_TYPE_VHOST_CMD;
> +			break;
> +		}
>  
>  		ev.events = EPOLLIN | EPOLLRDHUP;
>  		ev.data.u64 = ref.u64;
> @@ -1378,9 +1414,14 @@ void tap_backend_init(struct ctx *c)
>  		return;
>  	}
>  
> -	if (c->mode == MODE_PASTA) {
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		tap_sock_tun_init(c);
> -	} else {
> +		break;
> +	case MODE_VU:
> +		vu_init(c);
> +		/* fall through */
> +	case MODE_PASST:
>  		tap_sock_unix_init(c);
>  
>  		/* In passt mode, we don't know the guest's MAC address until it
> @@ -1388,6 +1429,7 @@ void tap_backend_init(struct ctx *c)
>  		 * first packets will reach it.
>  		 */
>  		memset(&c->guest_mac, 0xff, sizeof(c->guest_mac));
> +		break;
>  	}
>  
>  	tap_backend_show_hints(c);
> diff --git a/tap.h b/tap.h
> index 8728cc5c09c3..dfbd8b9ebd72 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
>   */
>  static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
>  {
> -	thdr->vnet_len = htonl(l2len);
> +	if (thdr)
> +		thdr->vnet_len = htonl(l2len);
>  }
>  
>  void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
> @@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  int tap_sock_unix_open(char *sock_path);
> +void tap_sock_reset(struct ctx *c);
> +void tap_sock_update_pool(void *base, size_t size);
>  void tap_backend_init(struct ctx *c);
>  void tap_flush_pools(void);
>  void tap_handler(struct ctx *c, const struct timespec *now);
> diff --git a/tcp.c b/tcp.c
> index eae02b1647e3..fd2def0d8a39 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -304,6 +304,7 @@
>  #include "flow_table.h"
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
> +#include "tcp_vu.h"
>  
>  /* MSS rounding: see SET_MSS() */
>  #define MSS_DEFAULT			536
> @@ -1328,6 +1329,9 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
>  static int tcp_send_flag(const struct ctx *c, struct tcp_tap_conn *conn,
>  			 int flags)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_send_flag(c, conn, flags);
> +
>  	return tcp_buf_send_flag(c, conn, flags);
>  }
>  
> @@ -1721,6 +1725,9 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
>   */
>  static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_data_from_sock(c, conn);
> +
>  	return tcp_buf_data_from_sock(c, conn);
>  }
>  
> diff --git a/tcp_vu.c b/tcp_vu.c
> new file mode 100644
> index 000000000000..1126fb39d138
> --- /dev/null
> +++ b/tcp_vu.c
> @@ -0,0 +1,476 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* tcp_vu.c - TCP L2 vhost-user management functions
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#include <errno.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +#include <netinet/ip.h>
> +
> +#include <sys/socket.h>
> +
> +#include <linux/tcp.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "vhost_user.h"
> +#include "tcp.h"
> +#include "pcap.h"
> +#include "flow.h"
> +#include "tcp_conn.h"
> +#include "flow_table.h"
> +#include "tcp_vu.h"
> +#include "tap.h"
> +#include "tcp_internal.h"
> +#include "checksum.h"
> +#include "vu_common.h"
> +
> +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + 1];
> +static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +
> +/**
> + * tcp_vu_l2_hdrlen() - return the size of the header in level 2 frame (TDP)
> + * @v6:		Set for IPv6 packet
> + *
> + * Return: Return the size of the header
> + */
> +static size_t tcp_vu_l2_hdrlen(bool v6)
> +{
> +	size_t l2_hdrlen;
> +
> +	l2_hdrlen = sizeof(struct ethhdr) + sizeof(struct tcphdr);
> +
> +	if (v6)
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	else
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +	return l2_hdrlen;
> +}
> +
> +/**
> + * tcp_vu_update_check() - Calculate TCP checksum
> + * @tapside:	Address information for one side of the flow
> + * @iov:	Pointer to the array of IO vectors
> + * @iov_used:	Length of the array
> + */
> +static void tcp_vu_update_check(const struct flowside *tapside,
> +			        struct iovec *iov, int iov_used)
> +{
> +	char *base = iov[0].iov_base;
> +
> +	if (inany_v4(&tapside->oaddr)) {
> +		const struct iphdr *iph = vu_ip(base);
> +
> +		tcp_update_check_tcp4(iph, iov, iov_used,
> +				      (char *)vu_payloadv4(base) - base);
> +	} else {
> +		const struct ipv6hdr *ip6h = vu_ip(base);
> +
> +		tcp_update_check_tcp6(ip6h, iov, iov_used,
> +				      (char *)vu_payloadv6(base) - base);
> +	}
> +}
> +
> +/**
> + * tcp_vu_send_flag() - Send segment with flags to vhost-user (no payload)
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @flags:	TCP flags: if not set, send segment only if ACK is due
> + *
> + * Return: negative error code on connection reset, 0 otherwise
> + */
> +int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	size_t l2len, l4len, optlen, hdrlen;
> +	struct ethhdr *eh;
> +	int elem_cnt;
> +	int nb_ack;
> +	int ret;
> +
> +	hdrlen = tcp_vu_l2_hdrlen(CONN_V6(conn));
> +
> +	vu_init_elem(elem, iov_vu, 2);
> +
> +	elem_cnt = vu_collect_one_frame(vdev, vq, elem, 1,
> +					hdrlen + OPT_MSS_LEN + OPT_WS_LEN + 1,
> +					0, NULL);
> +	if (elem_cnt < 1)
> +		return 0;
> +
> +	vu_set_vnethdr(vdev, &iov_vu[0], 1, 0);
> +
> +	eh = vu_eth(iov_vu[0].iov_base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	if (CONN_V4(conn)) {
> +		struct tcp_payload_t *payload;
> +		struct iphdr *iph;
> +		uint32_t seq;
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = vu_ip(iov_vu[0].iov_base);
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv4(iov_vu[0].iov_base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
> +		payload->th.ack = 1;
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
> +					(char *)payload->data, &optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_fill_headers4(conn, NULL, iph, payload, optlen,
> +					  NULL, seq, true);
> +		l2len = sizeof(*iph);
> +	} else {
> +		struct tcp_payload_t *payload;
> +		struct ipv6hdr *ip6h;
> +		uint32_t seq;
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = vu_ip(iov_vu[0].iov_base);
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv6(iov_vu[0].iov_base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_flags_t, opts) / 4;
> +		payload->th.ack = 1;
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th,
> +					(char *)payload->data, &optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_fill_headers6(conn, NULL, ip6h, payload, optlen,
> +					  seq, true);
> +		l2len = sizeof(*ip6h);
> +	}
> +	l2len += l4len + sizeof(struct ethhdr);
> +
> +	elem[0].in_sg[0].iov_len = l2len +
> +				   sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	if (*c->pcap) {
> +		tcp_vu_update_check(tapside, &elem[0].in_sg[0], 1);
> +		pcap_iov(&elem[0].in_sg[0], 1,
> +			 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +	}
> +	nb_ack = 1;
> +
> +	if (flags & DUP_ACK) {
> +		elem_cnt = vu_collect_one_frame(vdev, vq, &elem[1], 1, l2len,
> +						0, NULL);
> +		if (elem_cnt == 1) {
> +			memcpy(elem[1].in_sg[0].iov_base,
> +			       elem[0].in_sg[0].iov_base, l2len);
> +			vu_set_vnethdr(vdev, &elem[1].in_sg[0], 1, 0);
> +			nb_ack++;
> +
> +			if (*c->pcap)
> +				pcap_iov(&elem[1].in_sg[0], 1, 0);
> +		}
> +	}
> +
> +	vu_flush(vdev, vq, elem, nb_ack);
> +
> +	return 0;
> +}
> +
> +/** tcp_vu_sock_recv() - Receive datastream from socket into vhost-user buffers
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @v4:		Set for IPv4 connections
> + * @fillsize:	Number of bytes we can receive

So, it's the third time I review a version of this function, and the
third time I ask myself in which sense we _can_ receive those bytes. :)

Now that I remembered: what about "Maximum bytes to fill in guest-side
receiving window"?

> + * @datalen:	Size of received data (output)
> + *
> + * Return: Number of iov entries used to store the data
> + */
> +static ssize_t tcp_vu_sock_recv(const struct ctx *c,
> +				struct tcp_tap_conn *conn, bool v4,
> +				size_t fillsize, ssize_t *dlen)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct msghdr mh_sock = { 0 };
> +	uint16_t mss = MSS_GET(conn);
> +	int s = conn->sock;
> +	size_t l2_hdrlen;
> +	int elem_cnt;
> +	ssize_t ret;
> +
> +	*dlen = 0;
> +
> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> +
> +	vu_init_elem(elem, &iov_vu[1], VIRTQUEUE_MAX_SIZE);
> +
> +	elem_cnt = vu_collect(vdev, vq, elem, VIRTQUEUE_MAX_SIZE, mss,
> +			      l2_hdrlen, fillsize);
> +	if (elem_cnt < 0) {
> +		tcp_rst(c, conn);
> +		return -ENOMEM;
> +	}
> +
> +	mh_sock.msg_iov = iov_vu;
> +	mh_sock.msg_iovlen = elem_cnt + 1;
> +
> +	do
> +		ret = recvmsg(s, &mh_sock, MSG_PEEK);
> +	while (ret < 0 && errno == EINTR);
> +
> +	if (ret < 0) {
> +		vu_queue_rewind(vq, elem_cnt);
> +		if (errno != EAGAIN && errno != EWOULDBLOCK) {
> +			ret = -errno;
> +			tcp_rst(c, conn);
> +		}
> +		return ret;
> +	}
> +	if (!ret) {
> +		vu_queue_rewind(vq, elem_cnt);
> +
> +		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> +			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);
> +			if (retf) {
> +				tcp_rst(c, conn);
> +				return retf;
> +			}
> +
> +			conn_event(c, conn, TAP_FIN_SENT);
> +		}
> +		return 0;
> +	}
> +
> +	*dlen = ret;
> +
> +	return elem_cnt;
> +}
> +
> +/**
> + * tcp_vu_prepare() - Prepare the packet header
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + * @first:	Pointer to the array of IO vectors
> + * @dlen:	Packet data length
> + * @check:	Checksum, if already known
> + */
> +static void tcp_vu_prepare(const struct ctx *c,
> +			   struct tcp_tap_conn *conn, struct iovec *first,
> +			   size_t dlen, const uint16_t **check)
> +{
> +	const struct flowside *toside = TAPFLOW(conn);
> +	char *base = first->iov_base;
> +	struct ethhdr *eh;
> +
> +	/* we guess the first iovec provided by the guest can embed
> +	 * all the headers needed by L2 frame
> +	 */

What happens if it doesn't (buggy guest)? Do we have a way to make sure
it's the case? I guess it's more straightforward to do this in
tcp_vu_data_from_sock() where we check and set iov_len (even though the
implication of VIRTIO_NET_F_MRG_RXBUF isn't totally clear to me).

> +
> +	eh = vu_eth(base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	/* initialize header */
> +	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
> +		struct tcp_payload_t *payload;
> +		struct iphdr *iph;
> +
> +		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
> +		       sizeof(struct ethhdr) + sizeof(struct iphdr) +
> +		       sizeof(struct tcphdr));
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = vu_ip(base);
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +		payload = vu_payloadv4(base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
> +		payload->th.ack = 1;
> +
> +		tcp_fill_headers4(conn, NULL, iph, payload, dlen,
> +				  *check, conn->seq_to_tap, true);
> +		*check = &iph->check;
> +	} else {
> +		struct tcp_payload_t *payload;
> +		struct ipv6hdr *ip6h;
> +
> +		ASSERT(first[0].iov_len >= sizeof(struct virtio_net_hdr_mrg_rxbuf) +
> +		       sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
> +		       sizeof(struct tcphdr));
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = vu_ip(base);
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = vu_payloadv6(base);
> +		memset(&payload->th, 0, sizeof(payload->th));
> +		payload->th.doff = offsetof(struct tcp_payload_t, data) / 4;
> +		payload->th.ack = 1;
> +
> +		tcp_fill_headers6(conn, NULL, ip6h, payload, dlen,
> +				  conn->seq_to_tap, true);
> +	}
> +}
> +
> +/**
> + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
> + *			     in window
> + * @c:		Execution context
> + * @conn:	Connection pointer
> + *
> + * Return: Negative on connection reset, 0 otherwise
> + */
> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> +{
> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	uint16_t mss = MSS_GET(conn);
> +	size_t l2_hdrlen, fillsize;
> +	int i, iov_cnt, iov_used;
> +	int v4 = CONN_V4(conn);
> +	uint32_t already_sent = 0;
> +	const uint16_t *check;
> +	struct iovec *first;
> +	int frame_size;
> +	int num_buffers;
> +	ssize_t len;
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		flow_err(conn,
> +			 "Got packet, but RX virtqueue not usable yet");
> +		return 0;
> +	}
> +
> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> +
> +	if (SEQ_LT(already_sent, 0)) {
> +		/* RFC 761, section 2.1. */
> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
> +		conn->seq_to_tap = conn->seq_ack_from_tap;
> +		already_sent = 0;
> +	}
> +
> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
> +		conn_flag(c, conn, STALLED);
> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +		return 0;
> +	}
> +
> +	/* Set up buffer descriptors we'll fill completely and partially. */
> +
> +	fillsize = wnd_scaled;
> +
> +	if (peek_offset_cap)
> +		already_sent = 0;
> +
> +	iov_vu[0].iov_base = tcp_buf_discard;
> +	iov_vu[0].iov_len = already_sent;
> +	fillsize -= already_sent;
> +
> +	/* collect the buffers from vhost-user and fill them with the
> +	 * data from the socket
> +	 */
> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
> +	if (iov_cnt <= 0)
> +		return iov_cnt;
> +
> +	len -= already_sent;
> +	if (len <= 0) {
> +		conn_flag(c, conn, STALLED);
> +		vu_queue_rewind(vq, iov_cnt);
> +		return 0;
> +	}
> +
> +	conn_flag(c, conn, ~STALLED);
> +
> +	/* Likely, some new data was acked too. */
> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
> +
> +	/* initialize headers */
> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> +	iov_used = 0;
> +	num_buffers = 0;
> +	check = NULL;
> +	frame_size = 0;
> +
> +	/* iov_vu is an array of buffers and the buffer size can be
> +	 * smaller than the frame size we want to use but with
> +	 * num_buffer we can merge several virtio iov buffers in one packet
> +	 * we need only to set the packet headers in the first iov and
> +	 * num_buffer to the number of iov entries

...this part is clear to me, what I don't understand is if we still
have a way to guarantee that the sum of several buffers is big enough
to fit frame_size bytes.

> +	 */
> +	for (i = 0; i < iov_cnt && len; i++) {
> +

Excess newline.

> +		if (frame_size == 0)
> +			first = &iov_vu[i + 1];
> +
> +		if (iov_vu[i + 1].iov_len > (size_t)len)
> +			iov_vu[i + 1].iov_len = len;
> +
> +		len -= iov_vu[i + 1].iov_len;
> +		iov_used++;
> +
> +		frame_size += iov_vu[i + 1].iov_len;
> +		num_buffers++;
> +
> +		if (frame_size >= mss || len == 0 ||
> +		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +			if (i + 1 == iov_cnt)
> +				check = NULL;
> +
> +			/* restore first iovec base: point to vnet header */
> +			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
> +
> +			tcp_vu_prepare(c, conn, first, frame_size, &check);
> +			if (*c->pcap)  {
> +				tcp_vu_update_check(tapside, first, num_buffers);
> +				pcap_iov(first, num_buffers,
> +					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +			}
> +
> +			conn->seq_to_tap += frame_size;

We always increase this, even if, later...

> +
> +			frame_size = 0;
> +			num_buffers = 0;
> +		}
> +	}
> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> +
> +	/* send packets */
> +	vu_flush(vdev, vq, elem, iov_used);

we fail to send packets, that is, even if vu_queue_fill_by_index()
returns early because (!vq->vring.avail).

We had this same issue on the non-vhost-user path until commit
a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for
dropped frames") (completely reworked with time). There, it was pretty
bad with small (default) values for wmem_max and rmem_max.

Now, I _guess_ with vhost-user it won't be so easy to hit that, because
virtqueue buffers are (altogether) bigger, so we can probably fix this
later, but if it's not exceedingly complicated, we should consider
fixing it now. If we hit something like that, the behaviour is pretty
bad, with constant retransmissions and stalls.

The mapping between queued frames and connections is done in
tcp_data_to_tap(), where tcp4_frame_conns[] and tcp6_frame_conns[]
items are set to the current (highest) value of tcp4_payload_used and
tcp6_payload_used.

Then, if we fail to transmit some frames, tcp_revert_seq() uses those
arrays to revert the seq_to_tap values.

I guess you could make vu_queue_fill_by_index() return an error,
propagate it, and make vu_flush() call something like tcp_revert_seq()
in case.

> +
> +	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +
> +	return 0;
> +}
> diff --git a/tcp_vu.h b/tcp_vu.h
> new file mode 100644
> index 000000000000..6ab6057f352a
> --- /dev/null
> +++ b/tcp_vu.h
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef TCP_VU_H
> +#define TCP_VU_H
> +
> +int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
> +
> +#endif  /*TCP_VU_H */
> diff --git a/udp.c b/udp.c
> index 8fc5d8099310..1171d9d1a75b 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -628,6 +628,11 @@ void udp_listen_sock_handler(const struct ctx *c,
>  			     union epoll_ref ref, uint32_t events,
>  			     const struct timespec *now)
>  {
> +	if (c->mode == MODE_VU) {
> +		udp_vu_listen_sock_handler(c, ref, events, now);
> +		return;
> +	}
> +
>  	udp_buf_listen_sock_handler(c, ref, events, now);
>  }
>  
> @@ -697,6 +702,11 @@ static void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
>  void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
>  			    uint32_t events, const struct timespec *now)
>  {
> +	if (c->mode == MODE_VU) {
> +		udp_vu_reply_sock_handler(c, ref, events, now);
> +		return;
> +	}
> +
>  	udp_buf_reply_sock_handler(c, ref, events, now);
>  }
>  
> diff --git a/udp_vu.c b/udp_vu.c
> new file mode 100644
> index 000000000000..3cb76945c9c1
> --- /dev/null
> +++ b/udp_vu.c
> @@ -0,0 +1,336 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* udp_vu.c - UDP L2 vhost-user management functions
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#include <unistd.h>
> +#include <assert.h>
> +#include <net/ethernet.h>
> +#include <net/if.h>
> +#include <netinet/in.h>
> +#include <netinet/ip.h>
> +#include <netinet/udp.h>
> +#include <stdint.h>
> +#include <stddef.h>
> +#include <sys/uio.h>
> +#include <linux/virtio_net.h>
> +
> +#include "checksum.h"
> +#include "util.h"
> +#include "ip.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "passt.h"
> +#include "pcap.h"
> +#include "log.h"
> +#include "vhost_user.h"
> +#include "udp_internal.h"
> +#include "flow.h"
> +#include "flow_table.h"
> +#include "udp_flow.h"
> +#include "udp_vu.h"
> +#include "vu_common.h"
> +
> +static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
> +static struct vu_virtq_element	elem		[VIRTQUEUE_MAX_SIZE];
> +
> +/**
> + * udp_vu_l2_hdrlen() - return the size of the header in level 2 frame (UDP)
> + * @v6:		Set for IPv6 packet
> + *
> + * Return: Return the size of the header
> + */
> +static size_t udp_vu_l2_hdrlen(bool v6)
> +{
> +	size_t l2_hdrlen;
> +
> +	l2_hdrlen = sizeof(struct ethhdr) + sizeof(struct udphdr);
> +
> +	if (v6)
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	else
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +	return l2_hdrlen;
> +}
> +
> +static int udp_vu_sock_init(int s, union sockaddr_inany *s_in)
> +{
> +	struct msghdr msg = {
> +		.msg_name = s_in,
> +		.msg_namelen = sizeof(union sockaddr_inany),
> +	};
> +
> +	return recvmsg(s, &msg, MSG_PEEK | MSG_DONTWAIT);
> +}
> +
> +/**
> + * udp_vu_sock_recv() - Receive datagrams from socket into vhost-user buffers
> + * @c:		Execution context
> + * @s:		Socket to receive from
> + * @events:	epoll events bitmap
> + * @v6:		Set for IPv6 connections
> + * @datalen:	Size of received data (output)

Now it's dlen.

> + *
> + * Return: Number of iov entries used to store the datagram
> + */
> +static int udp_vu_sock_recv(const struct ctx *c, int s, uint32_t events,
> +			    bool v6, ssize_t *dlen)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int max_elem, iov_cnt, idx, iov_used;
> +	struct msghdr msg  = { 0 };
> +	size_t off, l2_hdrlen;
> +
> +	ASSERT(!c->no_udp);
> +
> +	if (!(events & EPOLLIN))
> +		return 0;
> +
> +	/* compute L2 header length */
> +
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		max_elem = VIRTQUEUE_MAX_SIZE;
> +	else
> +		max_elem = 1;
> +
> +	l2_hdrlen = udp_vu_l2_hdrlen(v6);
> +
> +	vu_init_elem(elem, iov_vu, max_elem);
> +
> +	iov_cnt = vu_collect_one_frame(vdev, vq, elem, max_elem,
> +			      ETH_MAX_MTU - l2_hdrlen,
> +			      l2_hdrlen, NULL);

The indentation is a bit weird here, I would expect the fifth and
following arguments to be aligned under (.

> +	if (iov_cnt == 0)
> +		return 0;
> +
> +	msg.msg_iov = iov_vu;
> +	msg.msg_iovlen = iov_cnt;
> +
> +	*dlen = recvmsg(s, &msg, 0);
> +	if (*dlen < 0) {
> +		vu_queue_rewind(vq, iov_cnt);
> +		return 0;
> +	}
> +
> +	/* count the numbers of buffer filled by recvmsg() */
> +	idx = iov_skip_bytes(iov_vu, iov_cnt, *dlen, &off);
> +
> +	/* adjust last iov length */
> +	if (idx < iov_cnt)
> +		iov_vu[idx].iov_len = off;
> +	iov_used = idx + !!off;
> +
> +	/* we have at least the header */
> +	if (iov_used == 0)
> +		iov_used = 1;

Is iov_used == 0 the only case where we need to add 1 to it?

> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> +
> +	vu_set_vnethdr(vdev, &iov_vu[0], iov_used, l2_hdrlen);
> +
> +	return iov_used;
> +}
> +
> +/**
> + * udp_vu_prepare() - Prepare the packet header
> + * @c:		Execution context
> + * @toside:	Address information for one side of the flow
> + * @datalen:	Packet data length

dlen now.

> + *
> + * Return: Layer-4 length
> + */
> +static size_t udp_vu_prepare(const struct ctx *c,
> +			     const struct flowside *toside, ssize_t dlen)
> +{
> +	struct ethhdr *eh;
> +	size_t l4len;
> +
> +	/* ethernet header */
> +	eh = vu_eth(iov_vu[0].iov_base);
> +
> +	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
> +
> +	/* initialize header */
> +	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
> +		struct iphdr *iph = vu_ip(iov_vu[0].iov_base);
> +		struct udp_payload_t *bp = vu_payloadv4(iov_vu[0].iov_base);
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
> +
> +		l4len = udp_update_hdr4(iph, bp, toside, dlen, true);
> +	} else {
> +		struct ipv6hdr *ip6h = vu_ip(iov_vu[0].iov_base);
> +		struct udp_payload_t *bp = vu_payloadv6(iov_vu[0].iov_base);
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
> +
> +		l4len = udp_update_hdr6(ip6h, bp, toside, dlen, true);
> +	}
> +
> +	return l4len;
> +}
> +
> +/**
> + * udp_vu_csum() - Calculate and set checksum for a UDP packet
> + * @toside:	ddress information for one side of the flow

Address

> + * @l4len:	IPv4 Payload length

Not actually passed.

> + * @iov_used:	Length of the array

"Number of used iov_vu items"? Otherwise it's a bit hard to understand
which array you're referring to.

> + */
> +static void udp_vu_csum(const struct flowside *toside, int iov_used)
> +{
> +	const struct in_addr *src4 = inany_v4(&toside->oaddr);
> +	const struct in_addr *dst4 = inany_v4(&toside->eaddr);
> +	char *base = iov_vu[0].iov_base;
> +	struct udp_payload_t *bp;
> +
> +	if (src4 && dst4) {
> +		bp = vu_payloadv4(base);
> +		csum_udp4(&bp->uh, *src4, *dst4, iov_vu, iov_used,
> +			  (char *)&bp->data - base);
> +	} else {
> +		bp = vu_payloadv6(base);
> +		csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6,
> +			  iov_vu, iov_used, (char *)&bp->data - base);
> +	}
> +}
> +
> +/**
> + * udp_vu_listen_sock_handler() - Handle new data from socket
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @events:	epoll events bitmap
> + * @now:	Current timestamp
> + */
> +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int i;
> +
> +	if (udp_sock_errs(c, ref.fd, events) < 0) {
> +		err("UDP: Unrecoverable error on listening socket:"
> +		    " (%s port %hu)", pif_name(ref.udp.pif), ref.udp.port);
> +		return;
> +	}
> +
> +	for (i = 0; i < UDP_MAX_FRAMES; i++) {
> +		const struct flowside *toside;
> +		union sockaddr_inany s_in;
> +		flow_sidx_t batchsidx;
> +		uint8_t batchpif;
> +		ssize_t dlen;
> +		int iov_used;
> +		bool v6;
> +
> +		if (udp_vu_sock_init(ref.fd, &s_in) < 0)
> +			break;
> +
> +		batchsidx = udp_flow_from_sock(c, ref, &s_in, now);
> +		batchpif = pif_at_sidx(batchsidx);
> +
> +		if (batchpif != PIF_TAP) {
> +			if (flow_sidx_valid(batchsidx)) {
> +				flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
> +				struct udp_flow *uflow = udp_at_sidx(batchsidx);
> +
> +				flow_err(uflow,
> +					"No support for forwarding UDP from %s to %s",
> +					pif_name(pif_at_sidx(fromsidx)),
> +					pif_name(batchpif));
> +			} else {
> +				debug("Discarding 1 datagram without flow");
> +			}
> +
> +			continue;
> +		}
> +
> +		toside = flowside_at_sidx(batchsidx);
> +
> +		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> +
> +		iov_used = udp_vu_sock_recv(c, ref.fd, events, v6, &dlen);
> +		if (iov_used <= 0)
> +			break;
> +
> +		udp_vu_prepare(c, toside, dlen);
> +		if (*c->pcap) {
> +			udp_vu_csum(toside, iov_used);
> +			pcap_iov(iov_vu, iov_used,
> +				 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +		}
> +		vu_flush(vdev, vq, elem, iov_used);
> +	}
> +}
> +
> +/**
> + * udp_vu_reply_sock_handler() - Handle new data from flow specific socket
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @events:	epoll events bitmap
> + * @now:	Current timestamp
> + */
> +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			        uint32_t events, const struct timespec *now)
> +{
> +	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
> +	const struct flowside *toside = flowside_at_sidx(tosidx);
> +	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
> +	int from_s = uflow->s[ref.flowside.sidei];
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int i;
> +
> +	ASSERT(!c->no_udp);
> +
> +	if (udp_sock_errs(c, from_s, events) < 0) {
> +		flow_err(uflow, "Unrecoverable error on reply socket");
> +		flow_err_details(uflow);
> +		udp_flow_close(c, uflow);
> +		return;
> +	}
> +
> +	for (i = 0; i < UDP_MAX_FRAMES; i++) {
> +		uint8_t topif = pif_at_sidx(tosidx);
> +		ssize_t dlen;
> +		int iov_used;
> +		bool v6;
> +
> +		ASSERT(uflow);
> +
> +		if (topif != PIF_TAP) {
> +			uint8_t frompif = pif_at_sidx(ref.flowside);
> +
> +			flow_err(uflow,
> +				 "No support for forwarding UDP from %s to %s",
> +				 pif_name(frompif), pif_name(topif));
> +			continue;
> +		}
> +
> +		v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr));
> +
> +		iov_used = udp_vu_sock_recv(c, from_s, events, v6, &dlen);
> +		if (iov_used <= 0)
> +			break;
> +		flow_trace(uflow, "Received 1 datagram on reply socket");
> +		uflow->ts = now->tv_sec;
> +
> +		udp_vu_prepare(c, toside, dlen);
> +		if (*c->pcap) {
> +			udp_vu_csum(toside, iov_used);
> +			pcap_iov(iov_vu, iov_used,
> +				 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +		}
> +		vu_flush(vdev, vq, elem, iov_used);
> +	}
> +}
> diff --git a/udp_vu.h b/udp_vu.h
> new file mode 100644
> index 000000000000..ba7018d3bf01
> --- /dev/null
> +++ b/udp_vu.h
> @@ -0,0 +1,13 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef UDP_VU_H
> +#define UDP_VU_H
> +
> +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now);
> +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			       uint32_t events, const struct timespec *now);
> +#endif /* UDP_VU_H */
> diff --git a/vhost_user.c b/vhost_user.c
> index 1e302926b8fe..e905f3329f71 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -48,12 +48,13 @@
>  /* vhost-user version we are compatible with */
>  #define VHOST_USER_VERSION 1
>  
> +static struct vu_dev vdev_storage;

I see that struct vu_dev is 1564 bytes (on x86), but struct ctx is
580320 bytes (because of tcp_ctx and udp_ctx), so I wouldn't see a
problem if you embedded this directly into struct ctx without a pointer.

> +
>  /**
>   * vu_print_capabilities() - print vhost-user capabilities
>   * 			     this is part of the vhost-user backend
>   * 			     convention.
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_print_capabilities(void)
>  {
>  	info("{");
> @@ -163,9 +164,7 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
>   */
>  static void vu_remove_watch(const struct vu_dev *vdev, int fd)
>  {
> -	/* Placeholder to add passt related code */
> -	(void)vdev;
> -	(void)fd;
> +	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
>  }
>  
>  /**
> @@ -487,6 +486,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
>  		}
>  	}
>  
> +	/* As vu_packet_check_range() has no access to the number of
> +	 * memory regions, mark the end of the array with mmap_addr = 0
> +	 */
> +	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
> +	vdev->regions[vdev->nregions].mmap_addr = 0;
> +
> +	tap_sock_update_pool(vdev->regions, 0);
> +
>  	return false;
>  }
>  
> @@ -615,9 +622,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
>   */
>  static void vu_set_watch(const struct vu_dev *vdev, int fd)
>  {
> -	/* Placeholder to add passt related code */
> -	(void)vdev;
> -	(void)fd;
> +	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
> +	struct epoll_event ev = { 0 };
> +
> +	ev.data.u64 = ref.u64;
> +	ev.events = EPOLLIN;
> +	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev);
>  }
>  
>  /**
> @@ -829,14 +839,14 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
>   * @c:		execution context
>   * @vdev:	vhost-user device
>   */
> -/* cppcheck-suppress unusedFunction */
> -void vu_init(struct ctx *c, struct vu_dev *vdev)
> +void vu_init(struct ctx *c)
>  {
>  	int i;
>  
> -	vdev->context = c;
> +	c->vdev = &vdev_storage;
> +	c->vdev->context = c;
>  	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> -		vdev->vq[i] = (struct vu_virtq){
> +		c->vdev->vq[i] = (struct vu_virtq){

From previous patch: missing whitespace between ) and {

>  			.call_fd = -1,
>  			.kick_fd = -1,
>  			.err_fd = -1,
> @@ -849,7 +859,6 @@ void vu_init(struct ctx *c, struct vu_dev *vdev)
>   * vu_cleanup() - Reset vhost-user device
>   * @vdev:	vhost-user device
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_cleanup(struct vu_dev *vdev)
>  {
>  	unsigned int i;
> @@ -896,8 +905,7 @@ void vu_cleanup(struct vu_dev *vdev)
>   */
>  static void vu_sock_reset(struct vu_dev *vdev)
>  {
> -	/* Placeholder to add passt related code */
> -	(void)vdev;
> +	tap_sock_reset(vdev->context);
>  }
>  
>  static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
> @@ -925,7 +933,6 @@ static bool (*vu_handle[VHOST_USER_MAX])(struct vu_dev *vdev,
>   * @fd:		vhost-user message socket
>   * @events:	epoll events
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
>  {
>  	struct vhost_user_msg msg = { 0 };
> diff --git a/vhost_user.h b/vhost_user.h
> index 5af349ba58b8..464ba21e962f 100644
> --- a/vhost_user.h
> +++ b/vhost_user.h
> @@ -183,7 +183,6 @@ struct vhost_user_msg {
>   *
>   * Return: true if the virqueue is enabled, false otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  static inline bool vu_queue_enabled(const struct vu_virtq *vq)
>  {
>  	return vq->enable;
> @@ -195,14 +194,13 @@ static inline bool vu_queue_enabled(const struct vu_virtq *vq)
>   *
>   * Return: true if the virqueue is started, false otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  static inline bool vu_queue_started(const struct vu_virtq *vq)
>  {
>  	return vq->started;
>  }
>  
>  void vu_print_capabilities(void);
> -void vu_init(struct ctx *c, struct vu_dev *vdev);
> +void vu_init(struct ctx *c);
>  void vu_cleanup(struct vu_dev *vdev);
>  void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events);
>  #endif /* VHOST_USER_H */
> diff --git a/virtio.c b/virtio.c
> index 380590afbca3..0598ff479858 100644
> --- a/virtio.c
> +++ b/virtio.c
> @@ -328,7 +328,6 @@ static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>   * @dev:	Vhost-user device
>   * @vq:		Virtqueue
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>  {
>  	if (!vq->vring.avail)
> @@ -504,7 +503,6 @@ static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned i
>   *
>   * Return: -1 if there is an error, 0 otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
>  {
>  	unsigned int head;
> @@ -565,7 +563,6 @@ void vu_queue_unpop(struct vu_virtq *vq)
>   * @vq:		Virtqueue
>   * @num:	Number of element to unpop
>   */
> -/* cppcheck-suppress unusedFunction */
>  bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
>  {
>  	if (num > vq->inuse)
> @@ -621,7 +618,6 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
>   * @len:	Size of the element
>   * @idx:	Used ring entry index
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
>  		   unsigned int len, unsigned int idx)
>  {
> @@ -645,7 +641,6 @@ static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
>   * @vq:		Virtqueue
>   * @count:	Number of entry to flush
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
>  {
>  	uint16_t old, new;
> diff --git a/vu_common.c b/vu_common.c
> new file mode 100644
> index 000000000000..4977d6af0f92
> --- /dev/null
> +++ b/vu_common.c
> @@ -0,0 +1,385 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * common_vu.c - vhost-user common UDP and TCP functions
> + */
> +
> +#include <unistd.h>
> +#include <sys/uio.h>
> +#include <sys/eventfd.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "tap.h"
> +#include "vhost_user.h"
> +#include "pcap.h"
> +#include "vu_common.h"
> +
> +/**
> + * vu_packet_check_range() - Check if a given memory zone is contained in
> + * 			     a mapped guest memory region
> + * @buf:	Array of the available memory regions
> + * @offset:	Offset of data range in packet descriptor
> + * @size:	Length of desired data range
> + * @start:	Start of the packet descriptor
> + *
> + * Return: 0 if the zone is in a mapped memory region, -1 otherwise
> + */
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start)
> +{
> +	struct vu_dev_region *dev_region;
> +
> +	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		char *m = (char *)dev_region->mmap_addr;
> +
> +		if (m <= start &&
> +		    start + offset + len <= m + dev_region->mmap_offset +
> +					       dev_region->size)
> +			return 0;
> +	}
> +
> +	return -1;
> +}
> +
> +/**
> + * vu_init_elem() - initialize an array of virtqueue element with 1 iov in each
> + * @elem:	Array of virtqueue element to initialize
> + * @iov:	Array of iovec to assign to virtqueue element
> + * @elem_cnt:	Number of virtqueue element
> + */
> +void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov, int elem_cnt)
> +{
> +	int i;
> +
> +	for (i = 0; i < elem_cnt; i++) {
> +		elem[i].out_num = 0;
> +		elem[i].out_sg = NULL;
> +		elem[i].in_num = 1;
> +		elem[i].in_sg = &iov[i];
> +	}
> +}
> +
> +/**
> + * vu_collect_one_frame() - collect virtio buffers from a given virtqueue for
> + *			    one frame
> + * @vdev:		vhost-user device
> + * @vq:			virtqueue to collect from
> + * @elem:		Array of virtqueue element

elements

> + * 			each element must be initialized with one iovec entry
> + * 			in the in_sg array.
> + * @max_elem:		Number of virtqueue element in the array

elements (?)

> + * @size:		Maximum size of the data in the frame
> + * @hdrlen:		Size of the frame header

Return: count of usable elements from virtqueue (?)

> + */
> +int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq,
> +			 struct vu_virtq_element *elem, int max_elem,
> +			 size_t size, size_t hdrlen, size_t *frame_size)
> +{
> +	size_t current_size = 0;
> +	struct iovec *iov;
> +	int elem_cnt = 0;
> +	int ret;
> +
> +	/* header is at least virtio_net_hdr_mrg_rxbuf */
> +	hdrlen += sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +
> +	/* collect first (or unique) element, it will contain header */

s/unique/single/

> +	ret = vu_queue_pop(vdev, vq, &elem[0]);
> +	if (ret < 0)
> +		goto out;
> +
> +	if (elem[0].in_num < 1) {
> +		warn("virtio-net receive queue contains no in buffers");
> +		vu_queue_detach_element(vq);
> +		goto out;
> +	}
> +
> +	iov = &elem[elem_cnt].in_sg[0];
> +
> +	ASSERT(iov->iov_len >= hdrlen);
> +
> +	/* add space for header */
> +	iov->iov_base = (char *)iov->iov_base + hdrlen;
> +	iov->iov_len -= hdrlen;
> +
> +	if (iov->iov_len > size)
> +		iov->iov_len = size;
> +
> +	elem_cnt++;
> +	current_size = iov->iov_len;
> +
> +	if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		goto out;
> +
> +	/* if possible, coalesce buffers to reach size */
> +	while (current_size < size && elem_cnt < max_elem) {
> +
> +		ret = vu_queue_pop(vdev, vq, &elem[elem_cnt]);
> +		if (ret < 0)
> +			break;
> +
> +		if (elem[elem_cnt].in_num < 1) {
> +			warn("virtio-net receive queue contains no in buffers");
> +			vu_queue_detach_element(vq);
> +			break;
> +		}
> +
> +		iov = &elem[elem_cnt].in_sg[0];
> +
> +		if (iov->iov_len > size - current_size)
> +			iov->iov_len = size - current_size;
> +
> +		current_size += iov->iov_len;
> +		elem_cnt++;
> +	}
> +
> +out:
> +	if (frame_size)
> +		*frame_size = current_size;
> +
> +	return elem_cnt;
> +}
> +
> +/**
> + * vu_collect() - collect virtio buffers from a given virtqueue
> + * @vdev:		vhost-user device
> + * @vq:			virtqueue to collect from
> + * @elem:		Array of virtqueue element

elements

> + * 			each element must be initialized with one iovec entry
> + * 			in the in_sg array.
> + * @max_elem:		Number of virtqueue element in the array
> + * @max_frame_size:	Maximum size of the data in the frame
> + * @hdrlen:		Size of the frame header
> + * @size:		Total size of the buffers we need to collect
> + * 			(if size > max_frame_size, we collect several frame)

frames

 * Return: number of available buffers

> + */
> +int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
> +	       struct vu_virtq_element *elem, int max_elem,
> +	       size_t max_frame_size, size_t hdrlen, size_t size)
> +{
> +	int elem_cnt = 0;
> +
> +	while (size > 0 && elem_cnt < max_elem) {
> +		size_t frame_size;
> +		int cnt;
> +
> +		if (max_frame_size > size)
> +			max_frame_size = size;
> +
> +		cnt = vu_collect_one_frame(vdev, vq,
> +					   &elem[elem_cnt], max_elem - elem_cnt,
> +					   max_frame_size, hdrlen, &frame_size);
> +		if (cnt == 0)
> +			break;
> +
> +		size -= frame_size;
> +		elem_cnt += cnt;
> +
> +		if (frame_size < max_frame_size)
> +			break;
> +	}
> +
> +	return elem_cnt;
> +}
> +
> +/**
> + * vu_set_vnethdr() - set virtio-net headers in a given iovec
> + * @vdev:		vhost-user device
> + * @iov:		One iovec to initialize
> + * @num_buffers:	Number of guest buffers of the frame
> + * @hdrlen:		Size of the frame header
> + */
> +void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov,
> +		    int num_buffers, size_t hdrlen)
> +{
> +	struct virtio_net_hdr_mrg_rxbuf *vnethdr;
> +
> +	/* header is at least virtio_net_hdr_mrg_rxbuf */
> +	hdrlen += sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +
> +	/* NOLINTNEXTLINE(clang-analyzer-core.UndefinedBinaryOperatorResult) */
> +	iov->iov_base = (char *)iov->iov_base - hdrlen;
> +	iov->iov_len += hdrlen;
> +
> +	vnethdr = iov->iov_base;
> +	vnethdr->hdr = VU_HEADER;
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		vnethdr->num_buffers = htole16(num_buffers);
> +}
> +
> +/**
> + * vu_flush() - flush all the collected buffers to the vhost-user interface
> + * @vdev:	vhost-user device
> + * @vq:		vhost-user virtqueue
> + * @elem:	virtqueue element array to send back to the virqueue

virtqueue

> + * @iov_used:	Length of the array

It's elem_cnt now.

> + */
> +void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
> +	      struct vu_virtq_element *elem, int elem_cnt)
> +{
> +	int i;
> +
> +	for (i = 0; i < elem_cnt; i++)
> +		vu_queue_fill(vq, &elem[i], elem[i].in_sg[0].iov_len, i);
> +
> +	vu_queue_flush(vq, elem_cnt);
> +	vu_queue_notify(vdev, vq);
> +}
> +
> +/**
> + * vu_handle_tx() - Receive data from the TX virtqueue
> + * @vdev:	vhost-user device
> + * @index:	index of the virtqueue
> + * @now:	Current timestamp
> + */
> +static void vu_handle_tx(struct vu_dev *vdev, int index,
> +			 const struct timespec *now)
> +{
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> +	struct vu_virtq *vq = &vdev->vq[index];
> +	int hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	int out_sg_count;
> +	int count;
> +
> +	if (!VHOST_USER_IS_QUEUE_TX(index)) {
> +		debug("vhost-user: index %d is not a TX queue", index);
> +		return;
> +	}
> +
> +	tap_flush_pools();
> +
> +	count = 0;
> +	out_sg_count = 0;
> +	while (count < VIRTQUEUE_MAX_SIZE) {
> +		int ret;
> +
> +		elem[count].out_num = 1;
> +		elem[count].out_sg = &out_sg[out_sg_count];
> +		elem[count].in_num = 0;
> +		elem[count].in_sg = NULL;
> +		ret = vu_queue_pop(vdev, vq, &elem[count]);
> +		if (ret < 0)
> +			break;
> +		out_sg_count += elem[count].out_num;
> +
> +		if (elem[count].out_num < 1) {
> +			debug("virtio-net header not in first element");
> +			break;
> +		}
> +		ASSERT(elem[count].out_num == 1);
> +
> +		tap_add_packet(vdev->context,
> +			       elem[count].out_sg[0].iov_len - hdrlen,
> +			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
> +		count++;
> +	}
> +	tap_handler(vdev->context, now);
> +
> +	if (count) {
> +		int i;
> +
> +		for (i = 0; i < count; i++)
> +			vu_queue_fill(vq, &elem[i], 0, i);
> +		vu_queue_flush(vq, count);
> +		vu_queue_notify(vdev, vq);
> +	}
> +}
> +
> +/**
> + * vu_kick_cb() - Called on a kick event to start to receive data
> + * @vdev:	vhost-user device
> + * @ref:	epoll reference information
> + * @now:	Current timestamp
> + */
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now)
> +{
> +	eventfd_t kick_data;
> +	ssize_t rc;
> +	int idx;
> +
> +	for (idx = 0; idx < VHOST_USER_MAX_QUEUES; idx++) {
> +		if (vdev->vq[idx].kick_fd == ref.fd)
> +			break;
> +	}
> +
> +	if (idx == VHOST_USER_MAX_QUEUES)
> +		return;
> +
> +	rc = eventfd_read(ref.fd, &kick_data);
> +	if (rc == -1)
> +		die_perror("vhost-user kick eventfd_read()");
> +
> +	debug("vhost-user: ot kick_data: %016"PRIx64" idx:%d",

"ot"?

Missing space after "idx:".

> +	      kick_data, idx);
> +	if (VHOST_USER_IS_QUEUE_TX(idx))
> +		vu_handle_tx(vdev, idx, now);
> +}
> +
> +/**
> + * vu_send_single() - Send a buffer to the front-end using the RX virtqueue
> + * @c:		execution context
> + * @buf:	address of the buffer
> + * @size:	size of the buffer
> + *
> + * Return: number of bytes sent, -1 if there is an error

I would say it returns 0 on error.

> + */
> +int vu_send_single(const struct ctx *c, const void *buf, size_t size)
> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> +	size_t total;
> +	int elem_cnt, max_elem;
> +	int i;
> +
> +	debug("vu_send_single size %zu", size);
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		err("Got packet, but no available descriptors on RX virtq.");
> +		return 0;
> +	}
> +
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		max_elem = VIRTQUEUE_MAX_SIZE;
> +	else
> +		max_elem = 1;
> +
> +	vu_init_elem(elem, in_sg, max_elem);
> +
> +	elem_cnt = vu_collect_one_frame(vdev, vq, elem, max_elem, size,
> +					0, &total);
> +	if (total < size) {
> +		debug("vu_send_single: no space to send the data "
> +		      "elem_cnt %d size %zd", elem_cnt, total);
> +		goto err;
> +	}
> +
> +	vu_set_vnethdr(vdev, in_sg, elem_cnt, 0);
> +
> +	/* copy data from the buffer to the iovec */
> +	iov_from_buf(in_sg, elem_cnt, sizeof(struct virtio_net_hdr_mrg_rxbuf),
> +		     buf, size);
> +
> +	if (*c->pcap) {
> +		pcap_iov(in_sg, elem_cnt,
> +			 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +	}
> +
> +	vu_flush(vdev, vq, elem, elem_cnt);
> +
> +	debug("vhost-user sent %zu", total);
> +
> +	return total;
> +err:
> +	for (i = 0; i < elem_cnt; i++)
> +		vu_queue_detach_element(vq);
> +
> +	return 0;
> +}
> diff --git a/vu_common.h b/vu_common.h
> new file mode 100644
> index 000000000000..1d6048060059
> --- /dev/null
> +++ b/vu_common.h
> @@ -0,0 +1,47 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * vhost-user common UDP and TCP functions
> + */
> +
> +#ifndef VU_COMMON_H
> +#define VU_COMMON_H
> +#include <linux/virtio_net.h>
> +
> +static inline void *vu_eth(void *base)
> +{
> +	return ((char *)base + sizeof(struct virtio_net_hdr_mrg_rxbuf));
> +}
> +
> +static inline void *vu_ip(void *base)
> +{
> +	return (struct ethhdr *)vu_eth(base) + 1;
> +}
> +
> +static inline void *vu_payloadv4(void *base)
> +{
> +	return (struct iphdr *)vu_ip(base) + 1;
> +}
> +
> +static inline void *vu_payloadv6(void *base)
> +{
> +	return (struct ipv6hdr *)vu_ip(base) + 1;
> +}
> +
> +void vu_init_elem(struct vu_virtq_element *elem, struct iovec *iov,
> +		  int elem_cnt);
> +int vu_collect_one_frame(struct vu_dev *vdev, struct vu_virtq *vq,
> +			 struct vu_virtq_element *elem, int max_elem,
> +			 size_t size, size_t hdrlen, size_t *frame_size);
> +int vu_collect(struct vu_dev *vdev, struct vu_virtq *vq,
> +	       struct vu_virtq_element *elem, int max_elem,
> +	       size_t max_frame_size, size_t hdrlen, size_t size);
> +void vu_set_vnethdr(const struct vu_dev *vdev, struct iovec *iov,
> +                    int num_buffers, size_t hdrlen);
> +void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
> +	      struct vu_virtq_element *elem, int elem_cnt);
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now);
> +int vu_send_single(const struct ctx *c, const void *buf, size_t size);
> +#endif /* VU_COMMON_H */

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17  0:10   ` Stefano Brivio
@ 2024-10-17  7:28     ` Laurent Vivier
  2024-10-17  8:33       ` Stefano Brivio
  2024-11-14 10:20     ` Laurent Vivier
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-10-17  7:28 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 17/10/2024 02:10, Stefano Brivio wrote:
>> --- a/vhost_user.c
>> +++ b/vhost_user.c
>> @@ -48,12 +48,13 @@
>>   /* vhost-user version we are compatible with */
>>   #define VHOST_USER_VERSION 1
>>   
>> +static struct vu_dev vdev_storage;
> I see that struct vu_dev is 1564 bytes (on x86), but struct ctx is
> 580320 bytes (because of tcp_ctx and udp_ctx), so I wouldn't see a
> problem if you embedded this directly into struct ctx without a pointer.

I did in the past, but the problem here is most of the time ctx is passed to the function 
as a "const" and vdev (for the virqueues) must be modified (so it's not a const anymore).
But I can try again if you think "const" is not important.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17  7:28     ` Laurent Vivier
@ 2024-10-17  8:33       ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-17  8:33 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 17 Oct 2024 09:28:21 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/10/2024 02:10, Stefano Brivio wrote:
> >> --- a/vhost_user.c
> >> +++ b/vhost_user.c
> >> @@ -48,12 +48,13 @@
> >>   /* vhost-user version we are compatible with */
> >>   #define VHOST_USER_VERSION 1
> >>   
> >> +static struct vu_dev vdev_storage;  
> > I see that struct vu_dev is 1564 bytes (on x86), but struct ctx is
> > 580320 bytes (because of tcp_ctx and udp_ctx), so I wouldn't see a
> > problem if you embedded this directly into struct ctx without a pointer.  
> 
> I did in the past, but the problem here is most of the time ctx is passed to the function 
> as a "const" and vdev (for the virqueues) must be modified (so it's not a const anymore).
> But I can try again if you think "const" is not important.

Ah, right, sorry, I just assumed you did that for a reason of size.

No, I think const is a nice thing to have for the actual runtime
configuration, so I guess it's better to keep it like you did.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17  0:10       ` Stefano Brivio
@ 2024-10-17 11:25         ` Stefano Brivio
  2024-10-17 11:54           ` Laurent Vivier
  2024-10-17 17:18           ` Laurent Vivier
  2024-10-22 12:59         ` Laurent Vivier
  1 sibling, 2 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-17 11:25 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Thu, 17 Oct 2024 02:10:31 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Wed, 16 Oct 2024 11:41:34 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:  
> > > [Still partial review]    
> > [snip]  
> > > > +	if (peek_offset_cap)
> > > > +		already_sent = 0;
> > > > +
> > > > +	iov_vu[0].iov_base = tcp_buf_discard;
> > > > +	iov_vu[0].iov_len = already_sent;    
> > > 
> > > I think I had a similar comment to a previous revision. Now, I haven't
> > > tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> > > I think this should eventually follow the same logic as the (updated)
> > > tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> > > (!peek_offset_cap).
> > > 
> > > It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> > > starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> > > zero iov_len if (peek_offset_cap).    
> >   
> > > I'll test that (unless you already did) -- if it works, we can fix this
> > > up later as well.    
> > 
> > I believe I tested it at some point, and I think we're already using
> > it somewhere.  
> 
> I tested it again just to be sure on a recent net.git kernel: sometimes
> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
> (using loopback address): big transfer" test instead.
> 
> I can reproduce at least one of the two issues consistently (tests
> stopped 5 times out of 5).
> 
> The socat client completes the transfer, the server is still waiting
> for something. I haven't taken captures yet or tried to re-send from
> the client.

...Laurent, let me know if I should dig into this any further.

For reference, the kernel commit introducing SO_PEEK_OFF support for TCP
on IPv6 is be9a4fb831b8 ("tcp: add SO_PEEK_OFF socket option tor
TCPv6"). Without that commit, passt won't set peek_offset_cap.

It was added in 6.11-rc5, so it's part of kernel-6.11.3-200.fc40 (latest
stable kernel) for Fedora 40. passt will print "SO_PEEK_OFF supported"
if you run it with -d -f.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17 11:25         ` Stefano Brivio
@ 2024-10-17 11:54           ` Laurent Vivier
  2024-10-17 17:18           ` Laurent Vivier
  1 sibling, 0 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-17 11:54 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: David Gibson, passt-dev

On 17/10/2024 13:25, Stefano Brivio wrote:
> On Thu, 17 Oct 2024 02:10:31 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
>> On Wed, 16 Oct 2024 11:41:34 +1100
>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:
>>>> [Still partial review]
>>> [snip]
>>>>> +	if (peek_offset_cap)
>>>>> +		already_sent = 0;
>>>>> +
>>>>> +	iov_vu[0].iov_base = tcp_buf_discard;
>>>>> +	iov_vu[0].iov_len = already_sent;
>>>>
>>>> I think I had a similar comment to a previous revision. Now, I haven't
>>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
>>>> I think this should eventually follow the same logic as the (updated)
>>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
>>>> (!peek_offset_cap).
>>>>
>>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
>>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
>>>> zero iov_len if (peek_offset_cap).
>>>    
>>>> I'll test that (unless you already did) -- if it works, we can fix this
>>>> up later as well.
>>>
>>> I believe I tested it at some point, and I think we're already using
>>> it somewhere.
>>
>> I tested it again just to be sure on a recent net.git kernel: sometimes
>> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
>> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
>> (using loopback address): big transfer" test instead.
>>
>> I can reproduce at least one of the two issues consistently (tests
>> stopped 5 times out of 5).
>>
>> The socat client completes the transfer, the server is still waiting
>> for something. I haven't taken captures yet or tried to re-send from
>> the client.
> 
> ...Laurent, let me know if I should dig into this any further.
> 
> For reference, the kernel commit introducing SO_PEEK_OFF support for TCP
> on IPv6 is be9a4fb831b8 ("tcp: add SO_PEEK_OFF socket option tor
> TCPv6"). Without that commit, passt won't set peek_offset_cap.
> 
> It was added in 6.11-rc5, so it's part of kernel-6.11.3-200.fc40 (latest
> stable kernel) for Fedora 40. passt will print "SO_PEEK_OFF supported"
> if you run it with -d -f.
> 

I have this kernel on my laptop... but I need to reboot.
I will dig into this.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17 11:25         ` Stefano Brivio
  2024-10-17 11:54           ` Laurent Vivier
@ 2024-10-17 17:18           ` Laurent Vivier
  2024-10-17 17:25             ` Laurent Vivier
  2024-10-17 17:33             ` Stefano Brivio
  1 sibling, 2 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-17 17:18 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: David Gibson, passt-dev

On 17/10/2024 13:25, Stefano Brivio wrote:
> On Thu, 17 Oct 2024 02:10:31 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
>> On Wed, 16 Oct 2024 11:41:34 +1100
>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:
>>>> [Still partial review]
>>> [snip]
>>>>> +	if (peek_offset_cap)
>>>>> +		already_sent = 0;
>>>>> +
>>>>> +	iov_vu[0].iov_base = tcp_buf_discard;
>>>>> +	iov_vu[0].iov_len = already_sent;
>>>>
>>>> I think I had a similar comment to a previous revision. Now, I haven't
>>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
>>>> I think this should eventually follow the same logic as the (updated)
>>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
>>>> (!peek_offset_cap).
>>>>
>>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
>>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
>>>> zero iov_len if (peek_offset_cap).
>>>    
>>>> I'll test that (unless you already did) -- if it works, we can fix this
>>>> up later as well.
>>>
>>> I believe I tested it at some point, and I think we're already using
>>> it somewhere.
>>
>> I tested it again just to be sure on a recent net.git kernel: sometimes
>> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
>> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
>> (using loopback address): big transfer" test instead.
>>
>> I can reproduce at least one of the two issues consistently (tests
>> stopped 5 times out of 5).
>>
>> The socat client completes the transfer, the server is still waiting
>> for something. I haven't taken captures yet or tried to re-send from
>> the client.
> 
> ...Laurent, let me know if I should dig into this any further.
> 
> For reference, the kernel commit introducing SO_PEEK_OFF support for TCP
> on IPv6 is be9a4fb831b8 ("tcp: add SO_PEEK_OFF socket option tor
> TCPv6"). Without that commit, passt won't set peek_offset_cap.
> 
> It was added in 6.11-rc5, so it's part of kernel-6.11.3-200.fc40 (latest
> stable kernel) for Fedora 40. passt will print "SO_PEEK_OFF supported"
> if you run it with -d -f.
> 

I have kernel 6.11.3-200.fc40.x86_64 but the message is "SO_PEEK_OFF not supported".

Any idea?

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17 17:18           ` Laurent Vivier
@ 2024-10-17 17:25             ` Laurent Vivier
  2024-10-17 17:33             ` Stefano Brivio
  1 sibling, 0 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-17 17:25 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: David Gibson, passt-dev

On 17/10/2024 19:18, Laurent Vivier wrote:
> On 17/10/2024 13:25, Stefano Brivio wrote:
>> On Thu, 17 Oct 2024 02:10:31 +0200
>> Stefano Brivio <sbrivio@redhat.com> wrote:
>>
>>> On Wed, 16 Oct 2024 11:41:34 +1100
>>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>>
>>>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:
>>>>> [Still partial review]
>>>> [snip]
>>>>>> +    if (peek_offset_cap)
>>>>>> +        already_sent = 0;
>>>>>> +
>>>>>> +    iov_vu[0].iov_base = tcp_buf_discard;
>>>>>> +    iov_vu[0].iov_len = already_sent;
>>>>>
>>>>> I think I had a similar comment to a previous revision. Now, I haven't
>>>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
>>>>> I think this should eventually follow the same logic as the (updated)
>>>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
>>>>> (!peek_offset_cap).
>>>>>
>>>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
>>>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
>>>>> zero iov_len if (peek_offset_cap).
>>>>> I'll test that (unless you already did) -- if it works, we can fix this
>>>>> up later as well.
>>>>
>>>> I believe I tested it at some point, and I think we're already using
>>>> it somewhere.
>>>
>>> I tested it again just to be sure on a recent net.git kernel: sometimes
>>> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
>>> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
>>> (using loopback address): big transfer" test instead.
>>>
>>> I can reproduce at least one of the two issues consistently (tests
>>> stopped 5 times out of 5).
>>>
>>> The socat client completes the transfer, the server is still waiting
>>> for something. I haven't taken captures yet or tried to re-send from
>>> the client.
>>
>> ...Laurent, let me know if I should dig into this any further.
>>
>> For reference, the kernel commit introducing SO_PEEK_OFF support for TCP
>> on IPv6 is be9a4fb831b8 ("tcp: add SO_PEEK_OFF socket option tor
>> TCPv6"). Without that commit, passt won't set peek_offset_cap.
>>
>> It was added in 6.11-rc5, so it's part of kernel-6.11.3-200.fc40 (latest
>> stable kernel) for Fedora 40. passt will print "SO_PEEK_OFF supported"
>> if you run it with -d -f.
>>
> 
> I have kernel 6.11.3-200.fc40.x86_64 but the message is "SO_PEEK_OFF not supported".

strace gives me:

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 73
setsockopt(73, SOL_SOCKET, SO_PEEK_OFF, [0], 4) = 0
socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 73
setsockopt(73, SOL_SOCKET, SO_PEEK_OFF, [0], 4) = -1 EOPNOTSUPP (Operation not supported)

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17 17:18           ` Laurent Vivier
  2024-10-17 17:25             ` Laurent Vivier
@ 2024-10-17 17:33             ` Stefano Brivio
  2024-10-17 21:21               ` Stefano Brivio
  1 sibling, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-10-17 17:33 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Thu, 17 Oct 2024 19:18:57 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/10/2024 13:25, Stefano Brivio wrote:
> > On Thu, 17 Oct 2024 02:10:31 +0200
> > Stefano Brivio <sbrivio@redhat.com> wrote:
> >   
> >> On Wed, 16 Oct 2024 11:41:34 +1100
> >> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>  
> >>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:  
> >>>> [Still partial review]  
> >>> [snip]  
> >>>>> +	if (peek_offset_cap)
> >>>>> +		already_sent = 0;
> >>>>> +
> >>>>> +	iov_vu[0].iov_base = tcp_buf_discard;
> >>>>> +	iov_vu[0].iov_len = already_sent;  
> >>>>
> >>>> I think I had a similar comment to a previous revision. Now, I haven't
> >>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> >>>> I think this should eventually follow the same logic as the (updated)
> >>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> >>>> (!peek_offset_cap).
> >>>>
> >>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> >>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> >>>> zero iov_len if (peek_offset_cap).  
> >>>      
> >>>> I'll test that (unless you already did) -- if it works, we can fix this
> >>>> up later as well.  
> >>>
> >>> I believe I tested it at some point, and I think we're already using
> >>> it somewhere.  
> >>
> >> I tested it again just to be sure on a recent net.git kernel: sometimes
> >> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
> >> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
> >> (using loopback address): big transfer" test instead.
> >>
> >> I can reproduce at least one of the two issues consistently (tests
> >> stopped 5 times out of 5).
> >>
> >> The socat client completes the transfer, the server is still waiting
> >> for something. I haven't taken captures yet or tried to re-send from
> >> the client.  
> > 
> > ...Laurent, let me know if I should dig into this any further.
> > 
> > For reference, the kernel commit introducing SO_PEEK_OFF support for TCP
> > on IPv6 is be9a4fb831b8 ("tcp: add SO_PEEK_OFF socket option tor
> > TCPv6"). Without that commit, passt won't set peek_offset_cap.
> > 
> > It was added in 6.11-rc5, so it's part of kernel-6.11.3-200.fc40 (latest
> > stable kernel) for Fedora 40. passt will print "SO_PEEK_OFF supported"
> > if you run it with -d -f.
> >   
> 
> I have kernel 6.11.3-200.fc40.x86_64 but the message is "SO_PEEK_OFF not supported".
> 
> Any idea?

Grr, sorry, I used 'git describe' wrong. That commit will be in 6.12
(not released yet), it's not in 6.11.

For testing, you can force peek_offset_cap = true in tcp.c, as long as
you don't use IPv6 (you can pass "-4" to passt just to be sure) it's
fine.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17 17:33             ` Stefano Brivio
@ 2024-10-17 21:21               ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-17 21:21 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Thu, 17 Oct 2024 19:33:38 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Thu, 17 Oct 2024 19:18:57 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > On 17/10/2024 13:25, Stefano Brivio wrote:  
> > > On Thu, 17 Oct 2024 02:10:31 +0200
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > >     
> > >> On Wed, 16 Oct 2024 11:41:34 +1100
> > >> David Gibson <david@gibson.dropbear.id.au> wrote:
> > >>    
> > >>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:    
> > >>>> [Still partial review]    
> > >>> [snip]    
> > >>>>> +	if (peek_offset_cap)
> > >>>>> +		already_sent = 0;
> > >>>>> +
> > >>>>> +	iov_vu[0].iov_base = tcp_buf_discard;
> > >>>>> +	iov_vu[0].iov_len = already_sent;    
> > >>>>
> > >>>> I think I had a similar comment to a previous revision. Now, I haven't
> > >>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> > >>>> I think this should eventually follow the same logic as the (updated)
> > >>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> > >>>> (!peek_offset_cap).
> > >>>>
> > >>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> > >>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> > >>>> zero iov_len if (peek_offset_cap).    
> > >>>        
> > >>>> I'll test that (unless you already did) -- if it works, we can fix this
> > >>>> up later as well.    
> > >>>
> > >>> I believe I tested it at some point, and I think we're already using
> > >>> it somewhere.    
> > >>
> > >> I tested it again just to be sure on a recent net.git kernel: sometimes
> > >> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
> > >> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
> > >> (using loopback address): big transfer" test instead.
> > >>
> > >> I can reproduce at least one of the two issues consistently (tests
> > >> stopped 5 times out of 5).
> > >>
> > >> The socat client completes the transfer, the server is still waiting
> > >> for something. I haven't taken captures yet or tried to re-send from
> > >> the client.    
> > > 
> > > ...Laurent, let me know if I should dig into this any further.
> > > 
> > > For reference, the kernel commit introducing SO_PEEK_OFF support for TCP
> > > on IPv6 is be9a4fb831b8 ("tcp: add SO_PEEK_OFF socket option tor
> > > TCPv6"). Without that commit, passt won't set peek_offset_cap.
> > > 
> > > It was added in 6.11-rc5, so it's part of kernel-6.11.3-200.fc40 (latest
> > > stable kernel) for Fedora 40. passt will print "SO_PEEK_OFF supported"
> > > if you run it with -d -f.
> > >     
> > 
> > I have kernel 6.11.3-200.fc40.x86_64 but the message is "SO_PEEK_OFF not supported".
> > 
> > Any idea?  
> 
> Grr, sorry, I used 'git describe' wrong. That commit will be in 6.12
> (not released yet), it's not in 6.11.
> 
> For testing, you can force peek_offset_cap = true in tcp.c, as long as
> you don't use IPv6 (you can pass "-4" to passt just to be sure) it's
> fine.

Or... I didn't know about this until now:

  https://fedoraproject.org/wiki/RawhideKernelNodebug

as well as:

  https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories
  https://docs.fedoraproject.org/en-US/releases/rawhide/#_questions_and_answers

and kernel-6.12.0-0.rc3.20241015giteca631b8fe80.32.fc42 surely includes
that commit:

  https://bodhi.fedoraproject.org/updates/FEDORA-2024-4f8beaeee0

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17  0:10       ` Stefano Brivio
  2024-10-17 11:25         ` Stefano Brivio
@ 2024-10-22 12:59         ` Laurent Vivier
  2024-10-22 13:19           ` Stefano Brivio
  2024-10-23 15:27           ` Laurent Vivier
  1 sibling, 2 replies; 50+ messages in thread
From: Laurent Vivier @ 2024-10-22 12:59 UTC (permalink / raw)
  To: Stefano Brivio, David Gibson; +Cc: passt-dev

On 17/10/2024 02:10, Stefano Brivio wrote:
> On Wed, 16 Oct 2024 11:41:34 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:
>>> [Still partial review]
>> [snip]
>>>> +	if (peek_offset_cap)
>>>> +		already_sent = 0;
>>>> +
>>>> +	iov_vu[0].iov_base = tcp_buf_discard;
>>>> +	iov_vu[0].iov_len = already_sent;
>>>
>>> I think I had a similar comment to a previous revision. Now, I haven't
>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
>>> I think this should eventually follow the same logic as the (updated)
>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
>>> (!peek_offset_cap).
>>>
>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
>>> zero iov_len if (peek_offset_cap).
>>
>>> I'll test that (unless you already did) -- if it works, we can fix this
>>> up later as well.
>>
>> I believe I tested it at some point, and I think we're already using
>> it somewhere.
> 
> I tested it again just to be sure on a recent net.git kernel: sometimes
> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
> (using loopback address): big transfer" test instead.
> 
> I can reproduce at least one of the two issues consistently (tests
> stopped 5 times out of 5).
> 
> The socat client completes the transfer, the server is still waiting
> for something. I haven't taken captures yet or tried to re-send from
> the client.
> 
> It all works (consistently) with an older kernel without support for
> SO_PEEK_OFF on TCP, but also on this kernel if I force peek_offset_cap
> to false in tcp_init().
> 

I have a fix for that but there is an error I don't understand:
when I run twice the test, the second time I have:

guest:
# socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
# socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
2024/10/22 08:51:58 socat[1485] E bind(5, {AF=2 0.0.0.0:10001}, 16): Address already in use

host:
$ socat -u OPEN:test/big.bin TCP4:127.0.0.1:10001

If I wait a little it can work again several times and fails again.

Any idea?

The patch is:
diff --git a/tcp_vu.c b/tcp_vu.c
index 78884c673215..83e40fb07a03 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -379,6 +379,10 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
                            conn->seq_ack_from_tap, conn->seq_to_tap);
                 conn->seq_to_tap = conn->seq_ack_from_tap;
                 already_sent = 0;
+               if (tcp_set_peek_offset(conn->sock, 0)) {
+                       tcp_rst(c, conn);
+                       return -1;
+               }
         }

         if (!wnd_scaled || already_sent >= wnd_scaled) {
@@ -389,14 +393,13 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn 
*conn)

         /* Set up buffer descriptors we'll fill completely and partially. */

-       fillsize = wnd_scaled;
+       fillsize = wnd_scaled - already_sent;

         if (peek_offset_cap)
                 already_sent = 0;

         iov_vu[0].iov_base = tcp_buf_discard;
         iov_vu[0].iov_len = already_sent;
-       fillsize -= already_sent;

         /* collect the buffers from vhost-user and fill them with the
          * data from the socket



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-22 12:59         ` Laurent Vivier
@ 2024-10-22 13:19           ` Stefano Brivio
  2024-10-22 18:19             ` Stefano Brivio
  2024-10-23 15:27           ` Laurent Vivier
  1 sibling, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-10-22 13:19 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Tue, 22 Oct 2024 14:59:19 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/10/2024 02:10, Stefano Brivio wrote:
> > On Wed, 16 Oct 2024 11:41:34 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> >> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:  
> >>> [Still partial review]  
> >> [snip]  
> >>>> +	if (peek_offset_cap)
> >>>> +		already_sent = 0;
> >>>> +
> >>>> +	iov_vu[0].iov_base = tcp_buf_discard;
> >>>> +	iov_vu[0].iov_len = already_sent;  
> >>>
> >>> I think I had a similar comment to a previous revision. Now, I haven't
> >>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> >>> I think this should eventually follow the same logic as the (updated)
> >>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> >>> (!peek_offset_cap).
> >>>
> >>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> >>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> >>> zero iov_len if (peek_offset_cap).  
> >>  
> >>> I'll test that (unless you already did) -- if it works, we can fix this
> >>> up later as well.  
> >>
> >> I believe I tested it at some point, and I think we're already using
> >> it somewhere.  
> > 
> > I tested it again just to be sure on a recent net.git kernel: sometimes
> > the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
> > transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
> > (using loopback address): big transfer" test instead.
> > 
> > I can reproduce at least one of the two issues consistently (tests
> > stopped 5 times out of 5).
> > 
> > The socat client completes the transfer, the server is still waiting
> > for something. I haven't taken captures yet or tried to re-send from
> > the client.
> > 
> > It all works (consistently) with an older kernel without support for
> > SO_PEEK_OFF on TCP, but also on this kernel if I force peek_offset_cap
> > to false in tcp_init().
> >   
> 
> I have a fix for that but there is an error I don't understand:
> when I run twice the test, the second time I have:
> 
> guest:
> # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> 2024/10/22 08:51:58 socat[1485] E bind(5, {AF=2 0.0.0.0:10001}, 16): Address already in use
> 
> host:
> $ socat -u OPEN:test/big.bin TCP4:127.0.0.1:10001
> 
> If I wait a little it can work again several times and fails again.
> 
> Any idea?

I guess the connection from the first test is not closed properly, so
it's still in TIME_WAIT or similar. Given the topic, I can imagine that
something goes wrong as we check:

        if (ack && conn->events & TAP_FIN_SENT &&
            conn->seq_ack_from_tap == conn->seq_to_tap)
                conn_event(c, conn, TAP_FIN_ACKED);

in tcp_data_from_tap(). Or something around FIN segments anyway.

You can give socat the 'reuseaddr' option, say:

  socat -u TCP4-LISTEN:10001,reuseaddr OPEN:test_big.bin,create,trunc

to see if that's actually the case.

> The patch is:
> 
> [...]

I can try that in a bit.

From --debug (or --trace) output the issue might be obvious, by the
way (you can enable that in the test suite with DEBUG=1). I guess
you'll see FIN from one side and not from the other one.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-22 13:19           ` Stefano Brivio
@ 2024-10-22 18:19             ` Stefano Brivio
  2024-10-22 18:22               ` Stefano Brivio
  0 siblings, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-10-22 18:19 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Tue, 22 Oct 2024 15:19:53 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Tue, 22 Oct 2024 14:59:19 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > On 17/10/2024 02:10, Stefano Brivio wrote:  
> > > On Wed, 16 Oct 2024 11:41:34 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >     
> > >> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:    
> > >>> [Still partial review]    
> > >> [snip]    
> > >>>> +	if (peek_offset_cap)
> > >>>> +		already_sent = 0;
> > >>>> +
> > >>>> +	iov_vu[0].iov_base = tcp_buf_discard;
> > >>>> +	iov_vu[0].iov_len = already_sent;    
> > >>>
> > >>> I think I had a similar comment to a previous revision. Now, I haven't
> > >>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> > >>> I think this should eventually follow the same logic as the (updated)
> > >>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> > >>> (!peek_offset_cap).
> > >>>
> > >>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> > >>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> > >>> zero iov_len if (peek_offset_cap).    
> > >>    
> > >>> I'll test that (unless you already did) -- if it works, we can fix this
> > >>> up later as well.    
> > >>
> > >> I believe I tested it at some point, and I think we're already using
> > >> it somewhere.    
> > > 
> > > I tested it again just to be sure on a recent net.git kernel: sometimes
> > > the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
> > > transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
> > > (using loopback address): big transfer" test instead.
> > > 
> > > I can reproduce at least one of the two issues consistently (tests
> > > stopped 5 times out of 5).
> > > 
> > > The socat client completes the transfer, the server is still waiting
> > > for something. I haven't taken captures yet or tried to re-send from
> > > the client.
> > > 
> > > It all works (consistently) with an older kernel without support for
> > > SO_PEEK_OFF on TCP, but also on this kernel if I force peek_offset_cap
> > > to false in tcp_init().
> > >     
> > 
> > I have a fix for that but there is an error I don't understand:
> > when I run twice the test, the second time I have:
> > 
> > guest:
> > # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> > # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> > 2024/10/22 08:51:58 socat[1485] E bind(5, {AF=2 0.0.0.0:10001}, 16): Address already in use
> > 
> > host:
> > $ socat -u OPEN:test/big.bin TCP4:127.0.0.1:10001
> > 
> > If I wait a little it can work again several times and fails again.
> > 
> > Any idea?  
> 
> I guess the connection from the first test is not closed properly, so
> it's still in TIME_WAIT or similar. Given the topic, I can imagine that
> something goes wrong as we check:
> 
>         if (ack && conn->events & TAP_FIN_SENT &&
>             conn->seq_ack_from_tap == conn->seq_to_tap)
>                 conn_event(c, conn, TAP_FIN_ACKED);
> 
> in tcp_data_from_tap(). Or something around FIN segments anyway.
> 
> You can give socat the 'reuseaddr' option, say:
> 
>   socat -u TCP4-LISTEN:10001,reuseaddr OPEN:test_big.bin,create,trunc
> 
> to see if that's actually the case.
> 
> > The patch is:
> > 
> > [...]  
> 
> I can try that in a bit.

...I was able to reproduce that only once, but I didn't have any
debugging option enabled.

I kept retrying with your follow-up patch but I'm back to the same
behaviour as I had without it (client completing transfer, server
hanging). That single time I reproduced it, both client and server
exited, but I couldn't restart the server right away on the same port.

Anyway, looking further into this: I'm using a recent net.git kernel,
with this series on top of current passt's HEAD and this follow-up
patch.

With --trace, I see that the transfer completes, then, once the client
is done, I get:

---
37.6288: Flow 0 (TCP connection): flag at tcp_vu_data_from_sock:417
37.6288: vhost-user: virtqueue can skip notify...
37.6288: Flow 0 (TCP connection): flag at tcp_vu_data_from_sock:477
37.6289: Flow 0 (TCP connection): timer expires in 2.000s
37.6289: passt: epoll event on connected TCP socket 75 (events: 0x00002001)
---

EPOLLRDHUP | EPOLLIN on the socket

---
37.6289: Flow 0 (TCP connection): event at tcp_sock_handler:2339
37.6289: Flow 0 (TCP connection): SOCK_FIN_RCVD: ESTABLISHED -> CLOSE_WAIT
---

FIN segment (inferred) from the socket

---
37.6289: Flow 0 (TCP connection): event at tcp_vu_sock_recv:262
37.6289: Flow 0 (TCP connection): TAP_FIN_SENT: CLOSE_WAIT -> LAST_ACK
---

FIN to the guest

---
37.6645: passt: epoll event on vhost-user kick socket 73 (events: 0x00000001)
37.6646: vhost-user: ot kick_data: 0000000000000001 idx:1
37.6646: tap: protocol 6, 88.198.0.164:6001 -> 88.198.0.161:53136 (1 packet)
37.6646: Flow 0 (TCP connection): packet length 20 from tap
37.6646: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1215
37.6646: Flow 0 (TCP connection): timer expires in 2.000s
37.6647: Flow 0 (TCP connection): flag at tcp_tap_handler:2095
37.6670: passt: epoll event on vhost-user kick socket 73 (events: 0x00000001)
37.6671: vhost-user: ot kick_data: 0000000000000001 idx:1
37.6671: tap: protocol 6, 88.198.0.164:6001 -> 88.198.0.161:53136 (2 packets)
37.6671: Flow 0 (TCP connection): packet length 20 from tap
37.6671: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1215
37.6671: Flow 0 (TCP connection): timer expires in 2.000s
37.6671: Flow 0 (TCP connection): flag at tcp_tap_handler:2095
37.6671: passt: epoll event on vhost-user kick socket 73 (events: 0x00000001)
37.6671: vhost-user: ot kick_data: 0000000000000002 idx:1
37.6672: tap: protocol 6, 88.198.0.164:6001 -> 88.198.0.161:53136 (1 packet)
37.6672: Flow 0 (TCP connection): packet length 20 from tap
37.6672: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1210
37.6672: Flow 0 (TCP connection): ACK_FROM_TAP_DUE dropped
37.6672: Flow 0 (TCP connection): event at tcp_data_from_tap:1922
37.6672: Flow 0 (TCP connection): TAP_FIN_ACKED: LAST_ACK -> LAST_ACK
---

everything ACKed from the guest, including the FIN

---
37.6672: Flow 0 (TCP connection): flag at tcp_tap_handler:2095
---

STALLED flag unset

...and now we're waiting for the guest to send us a FIN, so that we can
"send" it to the socket (shutdown(x, SHUT_WR)) and close the connection
in tap_sock_handler():

        if (conn->events & ESTABLISHED) {
                if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
                        conn_event(c, conn, CLOSED);

...but that never comes.

I also took a capture (attached), in a separate attempt matching this one
(timestamps might look different), which also shows that we're getting all
the packets acknowledged by the guest, but we're never sending a FIN to it.

This seems to be confirmed by stracing socat in the guest. The transfer
ends like this:

read(6, "M6\263%\245\257\205\24\341\316\377\270\306\301\244\17\333\241/E\211/\243g\367\23\216-\346\306\22\356"..., 8192) = 3056
recvfrom(3, 0x7ffeb3145360, 519, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(3, 0x7ffeb3145860, 519, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
write(5, "M6\263%\245\257\205\24\341\316\377\270\306\301\244\17\333\241/E\211/\243g\367\23\216-\346\306\22\356"..., 3056) = 3056
recvfrom(3, 0x7ffeb3145860, 519, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(3, 0x7ffeb31458d0, 519, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
pselect6(7, [6], [5], [], NULL, NULL)   = 1 (out [5])
recvfrom(3, 0x7ffeb31458d0, 519, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
recvfrom(3, 0x7ffeb31458d0, 519, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
pselect6(7, [6], [], [], NULL, NULL

...we should see an event on the socket and a zero-sized recvfrom() if
we sent a FIN.

If we see TAP_FIN_SENT in the logs, it means that tcp_vu_send_flag() is
called, but it doesn't look like it's sending the FIN | ACK flags as we
request in tcp_vu_sock_recv():

			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);

So I added prints in tcp_vu_send_flag():

---
diff --git a/tcp_vu.c b/tcp_vu.c
index fc077c7..234d909 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -99,6 +99,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	int nb_ack;
 	int ret;
 
+	err("%s:%i", __func__, __LINE__);
+
 	hdrlen = tcp_vu_l2_hdrlen(CONN_V6(conn));
 
 	vu_init_elem(elem, iov_vu, 2);
@@ -109,6 +111,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (elem_cnt < 1)
 		return 0;
 
+	err("%s:%i", __func__, __LINE__);
+
 	vu_set_vnethdr(vdev, &iov_vu[0], 1, 0);
 
 	eh = vu_eth(iov_vu[0].iov_base);
@@ -139,6 +143,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 			return ret;
 		}
 
+		err("%s:%i", __func__, __LINE__);
+
 		l4len = tcp_fill_headers4(conn, NULL, iph, payload, optlen,
 					  NULL, seq, true);
 		l2len = sizeof(*iph);
@@ -165,6 +171,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 			return ret;
 		}
 
+		err("%s:%i", __func__, __LINE__);
+
 		l4len = tcp_fill_headers6(conn, NULL, ip6h, payload, optlen,
 					  seq, true);
 		l2len = sizeof(*ip6h);
@@ -196,6 +204,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 
 	vu_flush(vdev, vq, elem, nb_ack);
 
+	err("%s:%i", __func__, __LINE__);
+
 	return 0;
 }
 
---

and it looks like vu_collect_one_frame() returns 0, so
tcp_vu_send_flag() returns 0, we think we sent the FIN segment to the
guest, but we didn't:

---
21.9120: Flow 0 (TCP connection): event at tcp_sock_handler:2339
21.9120: tcp_vu_send_flag:102
21.9120: Flow 0 (TCP connection): event at tcp_vu_sock_recv:272
21.9120: Flow 0 (TCP connection): TAP_FIN_SENT: CLOSE_WAIT -> LAST_ACK
---

...if vu_collect_one_frame() returns >= 1, as it usually does,
tcp_vu_send_flag() would reach at least line 114.

Side note: I thought I commented on this in a previous revision but
I can't find my comment anymore: if vu_collect_one_frame() returns 0,
I guess that should be an error.

I don't know why vu_collect_one_frame() returns 0 here... I hope you
do. :)

-- 
@@ -99,6 +99,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	int nb_ack;
 	int ret;
 
+	err("%s:%i", __func__, __LINE__);
+
 	hdrlen = tcp_vu_l2_hdrlen(CONN_V6(conn));
 
 	vu_init_elem(elem, iov_vu, 2);
@@ -109,6 +111,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (elem_cnt < 1)
 		return 0;
 
+	err("%s:%i", __func__, __LINE__);
+
 	vu_set_vnethdr(vdev, &iov_vu[0], 1, 0);
 
 	eh = vu_eth(iov_vu[0].iov_base);
@@ -139,6 +143,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 			return ret;
 		}
 
+		err("%s:%i", __func__, __LINE__);
+
 		l4len = tcp_fill_headers4(conn, NULL, iph, payload, optlen,
 					  NULL, seq, true);
 		l2len = sizeof(*iph);
@@ -165,6 +171,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 			return ret;
 		}
 
+		err("%s:%i", __func__, __LINE__);
+
 		l4len = tcp_fill_headers6(conn, NULL, ip6h, payload, optlen,
 					  seq, true);
 		l2len = sizeof(*ip6h);
@@ -196,6 +204,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 
 	vu_flush(vdev, vq, elem, nb_ack);
 
+	err("%s:%i", __func__, __LINE__);
+
 	return 0;
 }
 
---

and it looks like vu_collect_one_frame() returns 0, so
tcp_vu_send_flag() returns 0, we think we sent the FIN segment to the
guest, but we didn't:

---
21.9120: Flow 0 (TCP connection): event at tcp_sock_handler:2339
21.9120: tcp_vu_send_flag:102
21.9120: Flow 0 (TCP connection): event at tcp_vu_sock_recv:272
21.9120: Flow 0 (TCP connection): TAP_FIN_SENT: CLOSE_WAIT -> LAST_ACK
---

...if vu_collect_one_frame() returns >= 1, as it usually does,
tcp_vu_send_flag() would reach at least line 114.

Side note: I thought I commented on this in a previous revision but
I can't find my comment anymore: if vu_collect_one_frame() returns 0,
I guess that should be an error.

I don't know why vu_collect_one_frame() returns 0 here... I hope you
do. :)

-- 
Stefano


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-22 18:19             ` Stefano Brivio
@ 2024-10-22 18:22               ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-22 18:22 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev

On Tue, 22 Oct 2024 20:19:14 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> I also took a capture (attached)

...not really, it's two megs:

  https://passt.top/static/socat.pcap

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-22 12:59         ` Laurent Vivier
  2024-10-22 13:19           ` Stefano Brivio
@ 2024-10-23 15:27           ` Laurent Vivier
  2024-10-23 16:23             ` Stefano Brivio
  1 sibling, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-10-23 15:27 UTC (permalink / raw)
  To: Stefano Brivio, David Gibson; +Cc: passt-dev

On 22/10/2024 14:59, Laurent Vivier wrote:
> On 17/10/2024 02:10, Stefano Brivio wrote:
>> On Wed, 16 Oct 2024 11:41:34 +1100
>> David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:
>>>> [Still partial review]
>>> [snip]
>>>>> +    if (peek_offset_cap)
>>>>> +        already_sent = 0;
>>>>> +
>>>>> +    iov_vu[0].iov_base = tcp_buf_discard;
>>>>> +    iov_vu[0].iov_len = already_sent;
>>>>
>>>> I think I had a similar comment to a previous revision. Now, I haven't
>>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
>>>> I think this should eventually follow the same logic as the (updated)
>>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
>>>> (!peek_offset_cap).
>>>>
>>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
>>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
>>>> zero iov_len if (peek_offset_cap).
>>>
>>>> I'll test that (unless you already did) -- if it works, we can fix this
>>>> up later as well.
>>>
>>> I believe I tested it at some point, and I think we're already using
>>> it somewhere.
>>
>> I tested it again just to be sure on a recent net.git kernel: sometimes
>> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
>> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
>> (using loopback address): big transfer" test instead.
>>
>> I can reproduce at least one of the two issues consistently (tests
>> stopped 5 times out of 5).
>>
>> The socat client completes the transfer, the server is still waiting
>> for something. I haven't taken captures yet or tried to re-send from
>> the client.
>>
>> It all works (consistently) with an older kernel without support for
>> SO_PEEK_OFF on TCP, but also on this kernel if I force peek_offset_cap
>> to false in tcp_init().
>>
> 
> I have a fix for that but there is an error I don't understand:
> when I run twice the test, the second time I have:
> 
> guest:
> # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> 2024/10/22 08:51:58 socat[1485] E bind(5, {AF=2 0.0.0.0:10001}, 16): Address already in use
> 
> host:
> $ socat -u OPEN:test/big.bin TCP4:127.0.0.1:10001
> 
> If I wait a little it can work again several times and fails again.
> 
> Any idea?
> 
> The patch is:
> diff --git a/tcp_vu.c b/tcp_vu.c
> index 78884c673215..83e40fb07a03 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -379,6 +379,10 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn 
> *conn)
>                             conn->seq_ack_from_tap, conn->seq_to_tap);
>                  conn->seq_to_tap = conn->seq_ack_from_tap;
>                  already_sent = 0;
> +               if (tcp_set_peek_offset(conn->sock, 0)) {
> +                       tcp_rst(c, conn);
> +                       return -1;
> +               }
>          }
> 
>          if (!wnd_scaled || already_sent >= wnd_scaled) {
> @@ -389,14 +393,13 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn 
> *conn)
> 
>          /* Set up buffer descriptors we'll fill completely and partially. */
> 
> -       fillsize = wnd_scaled;
> +       fillsize = wnd_scaled - already_sent;
> 
>          if (peek_offset_cap)
>                  already_sent = 0;
> 
>          iov_vu[0].iov_base = tcp_buf_discard;
>          iov_vu[0].iov_len = already_sent;
> -       fillsize -= already_sent;
> 
>          /* collect the buffers from vhost-user and fill them with the
>           * data from the socket
> 
> 

For the moment, I can see a behavior change of recvmsg() with the new kernel.

without peek_offset_cap, if no new data is available, it returns "already_sent", so it 
enters in (found with tcp_vu.c but code samples from tcp_buf.c):

	==> recvmsg() returns already_sent, so len > 0

         ...
         sendlen -= already_sent; ==> here sendlen becomes 0

         if (sendlen <= 0) {
                 conn_flag(c, conn, STALLED);
                 return 0;
         }

With peek_offset, it returns -1, so it enters in:

         if (len < 0)
                 goto err;
...
err:
         if (errno != EAGAIN && errno != EWOULDBLOCK) {
                 ret = -errno;
                 tcp_rst(c, conn);
         }

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-23 15:27           ` Laurent Vivier
@ 2024-10-23 16:23             ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-10-23 16:23 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: David Gibson, passt-dev, Jon Maloy

On Wed, 23 Oct 2024 17:27:49 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> On 22/10/2024 14:59, Laurent Vivier wrote:
> > On 17/10/2024 02:10, Stefano Brivio wrote:  
> >> On Wed, 16 Oct 2024 11:41:34 +1100
> >> David Gibson <david@gibson.dropbear.id.au> wrote:
> >>  
> >>> On Tue, Oct 15, 2024 at 09:54:38PM +0200, Stefano Brivio wrote:  
> >>>> [Still partial review]  
> >>> [snip]  
> >>>>> +    if (peek_offset_cap)
> >>>>> +        already_sent = 0;
> >>>>> +
> >>>>> +    iov_vu[0].iov_base = tcp_buf_discard;
> >>>>> +    iov_vu[0].iov_len = already_sent;  
> >>>>
> >>>> I think I had a similar comment to a previous revision. Now, I haven't
> >>>> tested this (yet) on a kernel with support for SO_PEEK_OFF on TCP, but
> >>>> I think this should eventually follow the same logic as the (updated)
> >>>> tcp_buf_data_from_sock(): we should use tcp_buf_discard only if
> >>>> (!peek_offset_cap).
> >>>>
> >>>> It's fine to always initialise VIRTQUEUE_MAX_SIZE iov_vu items,
> >>>> starting from 1, for simplicity. But I'm not sure if it's safe to pass a
> >>>> zero iov_len if (peek_offset_cap).  
> >>>  
> >>>> I'll test that (unless you already did) -- if it works, we can fix this
> >>>> up later as well.  
> >>>
> >>> I believe I tested it at some point, and I think we're already using
> >>> it somewhere.  
> >>
> >> I tested it again just to be sure on a recent net.git kernel: sometimes
> >> the first test in passt_vu_in_ns/tcp, "TCP/IPv4: host to guest: big
> >> transfer" hangs on my setup, sometimes it's the "TCP/IPv4: ns to guest
> >> (using loopback address): big transfer" test instead.
> >>
> >> I can reproduce at least one of the two issues consistently (tests
> >> stopped 5 times out of 5).
> >>
> >> The socat client completes the transfer, the server is still waiting
> >> for something. I haven't taken captures yet or tried to re-send from
> >> the client.
> >>
> >> It all works (consistently) with an older kernel without support for
> >> SO_PEEK_OFF on TCP, but also on this kernel if I force peek_offset_cap
> >> to false in tcp_init().
> >>  
> > 
> > I have a fix for that but there is an error I don't understand:
> > when I run twice the test, the second time I have:
> > 
> > guest:
> > # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> > # socat -u TCP4-LISTEN:10001 OPEN:test_big.bin,create,trunc
> > 2024/10/22 08:51:58 socat[1485] E bind(5, {AF=2 0.0.0.0:10001}, 16): Address already in use
> > 
> > host:
> > $ socat -u OPEN:test/big.bin TCP4:127.0.0.1:10001
> > 
> > If I wait a little it can work again several times and fails again.
> > 
> > Any idea?
> > 
> > The patch is:
> > diff --git a/tcp_vu.c b/tcp_vu.c
> > index 78884c673215..83e40fb07a03 100644
> > --- a/tcp_vu.c
> > +++ b/tcp_vu.c
> > @@ -379,6 +379,10 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn 
> > *conn)
> >                             conn->seq_ack_from_tap, conn->seq_to_tap);
> >                  conn->seq_to_tap = conn->seq_ack_from_tap;
> >                  already_sent = 0;
> > +               if (tcp_set_peek_offset(conn->sock, 0)) {
> > +                       tcp_rst(c, conn);
> > +                       return -1;
> > +               }
> >          }
> > 
> >          if (!wnd_scaled || already_sent >= wnd_scaled) {
> > @@ -389,14 +393,13 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn 
> > *conn)
> > 
> >          /* Set up buffer descriptors we'll fill completely and partially. */
> > 
> > -       fillsize = wnd_scaled;
> > +       fillsize = wnd_scaled - already_sent;
> > 
> >          if (peek_offset_cap)
> >                  already_sent = 0;
> > 
> >          iov_vu[0].iov_base = tcp_buf_discard;
> >          iov_vu[0].iov_len = already_sent;
> > -       fillsize -= already_sent;
> > 
> >          /* collect the buffers from vhost-user and fill them with the
> >           * data from the socket
> > 
> >   
> 
> For the moment, I can see a behavior change of recvmsg() with the new kernel.
> 
> without peek_offset_cap, if no new data is available, it returns "already_sent", so it 
> enters in (found with tcp_vu.c but code samples from tcp_buf.c):
> 
> 	==> recvmsg() returns already_sent, so len > 0  
> 
>          ...
>          sendlen -= already_sent; ==> here sendlen becomes 0
> 
>          if (sendlen <= 0) {
>                  conn_flag(c, conn, STALLED);
>                  return 0;
>          }
> 
> With peek_offset, it returns -1, so it enters in:

This is expected, I think (and unfortunately not documented).

> 
>          if (len < 0)
>                  goto err;
> ...
> err:
>          if (errno != EAGAIN && errno != EWOULDBLOCK) {

But errno here should be EAGAIN, so yes, it looks buggy to me in the
sense that:

>                  ret = -errno;
>                  tcp_rst(c, conn);
>          }

we return 0 here without setting the STALLED flag. While it should
be fixed, that flag is some kind of optimisation, so this doesn't
really explain the issue that I mentioned in
20241022201914.072f7c7d@elisabeth:

  https://archives.passt.top/passt-dev/20241022201914.072f7c7d@elisabeth/

As a quick fix, you should probably do this in tcp_vu_data_from_sock():

	if (peek_offset_cap)	/* add this condition */
		len -= already_sent;

	if (len <= 0 || (peek_offset_cap && len == -1 && errno == EAGAIN))
		/* change this condition */
		...

...or you mean that due to this behaviour you don't call vu_queue_rewind()
and that causes troubles?

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17  0:10   ` Stefano Brivio
  2024-10-17  7:28     ` Laurent Vivier
@ 2024-11-14 10:20     ` Laurent Vivier
  2024-11-14 14:23       ` Stefano Brivio
  2024-11-14 10:23     ` Laurent Vivier
  2024-11-14 10:29     ` Laurent Vivier
  3 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-11-14 10:20 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 17/10/2024 02:10, Stefano Brivio wrote:
>> +		if (frame_size == 0)
>> +			first = &iov_vu[i + 1];
>> +
>> +		if (iov_vu[i + 1].iov_len > (size_t)len)
>> +			iov_vu[i + 1].iov_len = len;
>> +
>> +		len -= iov_vu[i + 1].iov_len;
>> +		iov_used++;
>> +
>> +		frame_size += iov_vu[i + 1].iov_len;
>> +		num_buffers++;
>> +
>> +		if (frame_size >= mss || len == 0 ||
>> +		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
>> +			if (i + 1 == iov_cnt)
>> +				check = NULL;
>> +
>> +			/* restore first iovec base: point to vnet header */
>> +			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
>> +
>> +			tcp_vu_prepare(c, conn, first, frame_size, &check);
>> +			if (*c->pcap)  {
>> +				tcp_vu_update_check(tapside, first, num_buffers);
>> +				pcap_iov(first, num_buffers,
>> +					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
>> +			}
>> +
>> +			conn->seq_to_tap += frame_size;
> We always increase this, even if, later...
> 
>> +
>> +			frame_size = 0;
>> +			num_buffers = 0;
>> +		}
>> +	}
>> +
>> +	/* release unused buffers */
>> +	vu_queue_rewind(vq, iov_cnt - iov_used);
>> +
>> +	/* send packets */
>> +	vu_flush(vdev, vq, elem, iov_used);
> we fail to send packets, that is, even if vu_queue_fill_by_index()
> returns early because (!vq->vring.avail).

vring.avail is a pointer to a structure. vring.avail is NULL if there is something wrong 
during the initialization. It's imported code, I think it's only a sanity check.
So in theory vu_flush() cannot fail.

> 
> We had this same issue on the non-vhost-user path until commit
> a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for
> dropped frames") (completely reworked with time). There, it was pretty
> bad with small (default) values for wmem_max and rmem_max.
> 
> Now, I_guess_ with vhost-user it won't be so easy to hit that, because
> virtqueue buffers are (altogether) bigger, so we can probably fix this
> later, but if it's not exceedingly complicated, we should consider
> fixing it now. If we hit something like that, the behaviour is pretty
> bad, with constant retransmissions and stalls.
> 
> The mapping between queued frames and connections is done in
> tcp_data_to_tap(), where tcp4_frame_conns[] and tcp6_frame_conns[]
> items are set to the current (highest) value of tcp4_payload_used and
> tcp6_payload_used.
> 
> Then, if we fail to transmit some frames, tcp_revert_seq() uses those
> arrays to revert the seq_to_tap values.
> 
> I guess you could make vu_queue_fill_by_index() return an error,
> propagate it, and make vu_flush() call something like tcp_revert_seq()
> in case.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17  0:10   ` Stefano Brivio
  2024-10-17  7:28     ` Laurent Vivier
  2024-11-14 10:20     ` Laurent Vivier
@ 2024-11-14 10:23     ` Laurent Vivier
  2024-11-14 14:23       ` Stefano Brivio
  2024-11-14 10:29     ` Laurent Vivier
  3 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-11-14 10:23 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 17/10/2024 02:10, Stefano Brivio wrote:
>> +/**
>> + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
>> + *			     in window
>> + * @c:		Execution context
>> + * @conn:	Connection pointer
>> + *
>> + * Return: Negative on connection reset, 0 otherwise
>> + */
>> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>> +{
>> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>> +	struct vu_dev *vdev = c->vdev;
>> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
>> +	const struct flowside *tapside = TAPFLOW(conn);
>> +	uint16_t mss = MSS_GET(conn);
>> +	size_t l2_hdrlen, fillsize;
>> +	int i, iov_cnt, iov_used;
>> +	int v4 = CONN_V4(conn);
>> +	uint32_t already_sent = 0;
>> +	const uint16_t *check;
>> +	struct iovec *first;
>> +	int frame_size;
>> +	int num_buffers;
>> +	ssize_t len;
>> +
>> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
>> +		flow_err(conn,
>> +			 "Got packet, but RX virtqueue not usable yet");
>> +		return 0;
>> +	}
>> +
>> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
>> +
>> +	if (SEQ_LT(already_sent, 0)) {
>> +		/* RFC 761, section 2.1. */
>> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
>> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
>> +		conn->seq_to_tap = conn->seq_ack_from_tap;
>> +		already_sent = 0;
>> +	}
>> +
>> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
>> +		conn_flag(c, conn, STALLED);
>> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>> +		return 0;
>> +	}
>> +
>> +	/* Set up buffer descriptors we'll fill completely and partially. */
>> +
>> +	fillsize = wnd_scaled;
>> +
>> +	if (peek_offset_cap)
>> +		already_sent = 0;
>> +
>> +	iov_vu[0].iov_base = tcp_buf_discard;
>> +	iov_vu[0].iov_len = already_sent;
>> +	fillsize -= already_sent;
>> +
>> +	/* collect the buffers from vhost-user and fill them with the
>> +	 * data from the socket
>> +	 */
>> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
>> +	if (iov_cnt <= 0)
>> +		return iov_cnt;
>> +
>> +	len -= already_sent;
>> +	if (len <= 0) {
>> +		conn_flag(c, conn, STALLED);
>> +		vu_queue_rewind(vq, iov_cnt);
>> +		return 0;
>> +	}
>> +
>> +	conn_flag(c, conn, ~STALLED);
>> +
>> +	/* Likely, some new data was acked too. */
>> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
>> +
>> +	/* initialize headers */
>> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
>> +	iov_used = 0;
>> +	num_buffers = 0;
>> +	check = NULL;
>> +	frame_size = 0;
>> +
>> +	/* iov_vu is an array of buffers and the buffer size can be
>> +	 * smaller than the frame size we want to use but with
>> +	 * num_buffer we can merge several virtio iov buffers in one packet
>> +	 * we need only to set the packet headers in the first iov and
>> +	 * num_buffer to the number of iov entries
> ...this part is clear to me, what I don't understand is if we still
> have a way to guarantee that the sum of several buffers is big enough
> to fit frame_size bytes.

We don't have this garantee. But I think it's the same for the socket version?

Thanks,
Laurent



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-10-17  0:10   ` Stefano Brivio
                       ` (2 preceding siblings ...)
  2024-11-14 10:23     ` Laurent Vivier
@ 2024-11-14 10:29     ` Laurent Vivier
  2024-11-14 14:23       ` Stefano Brivio
  3 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-11-14 10:29 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 17/10/2024 02:10, Stefano Brivio wrote:
>> +/**
>> + * tcp_vu_prepare() - Prepare the packet header
>> + * @c:		Execution context
>> + * @conn:	Connection pointer
>> + * @first:	Pointer to the array of IO vectors
>> + * @dlen:	Packet data length
>> + * @check:	Checksum, if already known
>> + */
>> +static void tcp_vu_prepare(const struct ctx *c,
>> +			   struct tcp_tap_conn *conn, struct iovec *first,
>> +			   size_t dlen, const uint16_t **check)
>> +{
>> +	const struct flowside *toside = TAPFLOW(conn);
>> +	char *base = first->iov_base;
>> +	struct ethhdr *eh;
>> +
>> +	/* we guess the first iovec provided by the guest can embed
>> +	 * all the headers needed by L2 frame
>> +	 */
> What happens if it doesn't (buggy guest)? Do we have a way to make sure
> it's the case? I guess it's more straightforward to do this in
> tcp_vu_data_from_sock() where we check and set iov_len (even though the
> implication of VIRTIO_NET_F_MRG_RXBUF isn't totally clear to me).

According to spec, minimum size of a buffer is 1526 bytes
(https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2340003)

So if the guest is buggy, we will write in the guest memory out of the (buggy) provided 
buffer, and we can crash the guest. But it's what happens to a buggy guest.

We can't fix the guest, IMO passt should crash in this case (add an ASSERT()?).

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-14 10:20     ` Laurent Vivier
@ 2024-11-14 14:23       ` Stefano Brivio
  2024-11-14 15:16         ` Laurent Vivier
  0 siblings, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-11-14 14:23 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 14 Nov 2024 11:20:09 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/10/2024 02:10, Stefano Brivio wrote:
> >> +		if (frame_size == 0)
> >> +			first = &iov_vu[i + 1];
> >> +
> >> +		if (iov_vu[i + 1].iov_len > (size_t)len)
> >> +			iov_vu[i + 1].iov_len = len;
> >> +
> >> +		len -= iov_vu[i + 1].iov_len;
> >> +		iov_used++;
> >> +
> >> +		frame_size += iov_vu[i + 1].iov_len;
> >> +		num_buffers++;
> >> +
> >> +		if (frame_size >= mss || len == 0 ||
> >> +		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> >> +			if (i + 1 == iov_cnt)
> >> +				check = NULL;
> >> +
> >> +			/* restore first iovec base: point to vnet header */
> >> +			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
> >> +
> >> +			tcp_vu_prepare(c, conn, first, frame_size, &check);
> >> +			if (*c->pcap)  {
> >> +				tcp_vu_update_check(tapside, first, num_buffers);
> >> +				pcap_iov(first, num_buffers,
> >> +					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> >> +			}
> >> +
> >> +			conn->seq_to_tap += frame_size;  
> > We always increase this, even if, later...
> >   
> >> +
> >> +			frame_size = 0;
> >> +			num_buffers = 0;
> >> +		}
> >> +	}
> >> +
> >> +	/* release unused buffers */
> >> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> >> +
> >> +	/* send packets */
> >> +	vu_flush(vdev, vq, elem, iov_used);  
> > we fail to send packets, that is, even if vu_queue_fill_by_index()
> > returns early because (!vq->vring.avail).  
> 
> vring.avail is a pointer to a structure. vring.avail is NULL if there is something wrong 
> during the initialization. It's imported code, I think it's only a sanity check.
> So in theory vu_flush() cannot fail.

Oh, I see now. I actually think it's preferable to crash in that
(theoretically impossible) case, without even an ASSERT() (we would
dereference a NULL pointer, eventually, even if not here).

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-14 10:23     ` Laurent Vivier
@ 2024-11-14 14:23       ` Stefano Brivio
  2024-11-15  8:30         ` Laurent Vivier
  0 siblings, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-11-14 14:23 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 14 Nov 2024 11:23:11 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/10/2024 02:10, Stefano Brivio wrote:
> >> +/**
> >> + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
> >> + *			     in window
> >> + * @c:		Execution context
> >> + * @conn:	Connection pointer
> >> + *
> >> + * Return: Negative on connection reset, 0 otherwise
> >> + */
> >> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> >> +{
> >> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> >> +	struct vu_dev *vdev = c->vdev;
> >> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> >> +	const struct flowside *tapside = TAPFLOW(conn);
> >> +	uint16_t mss = MSS_GET(conn);
> >> +	size_t l2_hdrlen, fillsize;
> >> +	int i, iov_cnt, iov_used;
> >> +	int v4 = CONN_V4(conn);
> >> +	uint32_t already_sent = 0;
> >> +	const uint16_t *check;
> >> +	struct iovec *first;
> >> +	int frame_size;
> >> +	int num_buffers;
> >> +	ssize_t len;
> >> +
> >> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> >> +		flow_err(conn,
> >> +			 "Got packet, but RX virtqueue not usable yet");
> >> +		return 0;
> >> +	}
> >> +
> >> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> >> +
> >> +	if (SEQ_LT(already_sent, 0)) {
> >> +		/* RFC 761, section 2.1. */
> >> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
> >> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
> >> +		conn->seq_to_tap = conn->seq_ack_from_tap;
> >> +		already_sent = 0;
> >> +	}
> >> +
> >> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
> >> +		conn_flag(c, conn, STALLED);
> >> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> >> +		return 0;
> >> +	}
> >> +
> >> +	/* Set up buffer descriptors we'll fill completely and partially. */
> >> +
> >> +	fillsize = wnd_scaled;
> >> +
> >> +	if (peek_offset_cap)
> >> +		already_sent = 0;
> >> +
> >> +	iov_vu[0].iov_base = tcp_buf_discard;
> >> +	iov_vu[0].iov_len = already_sent;
> >> +	fillsize -= already_sent;
> >> +
> >> +	/* collect the buffers from vhost-user and fill them with the
> >> +	 * data from the socket
> >> +	 */
> >> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
> >> +	if (iov_cnt <= 0)
> >> +		return iov_cnt;
> >> +
> >> +	len -= already_sent;
> >> +	if (len <= 0) {
> >> +		conn_flag(c, conn, STALLED);
> >> +		vu_queue_rewind(vq, iov_cnt);
> >> +		return 0;
> >> +	}
> >> +
> >> +	conn_flag(c, conn, ~STALLED);
> >> +
> >> +	/* Likely, some new data was acked too. */
> >> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
> >> +
> >> +	/* initialize headers */
> >> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> >> +	iov_used = 0;
> >> +	num_buffers = 0;
> >> +	check = NULL;
> >> +	frame_size = 0;
> >> +
> >> +	/* iov_vu is an array of buffers and the buffer size can be
> >> +	 * smaller than the frame size we want to use but with
> >> +	 * num_buffer we can merge several virtio iov buffers in one packet
> >> +	 * we need only to set the packet headers in the first iov and
> >> +	 * num_buffer to the number of iov entries  
> > ...this part is clear to me, what I don't understand is if we still
> > have a way to guarantee that the sum of several buffers is big enough
> > to fit frame_size bytes.  
> 
> We don't have this garantee. But I think it's the same for the socket version?

Well, there we do:

	fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
	if (fill_bufs > TCP_FRAMES) {
		fill_bufs = TCP_FRAMES;

and we don't fetch more data than that from the socket (in one pass).

Is this implicit in the i < iov_cnt loop condition here? That's the part
I don't understand: how do we limit the amount of data we can dequeue
from a socket in one single pass.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-14 10:29     ` Laurent Vivier
@ 2024-11-14 14:23       ` Stefano Brivio
  2024-11-15 11:13         ` Laurent Vivier
  0 siblings, 1 reply; 50+ messages in thread
From: Stefano Brivio @ 2024-11-14 14:23 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 14 Nov 2024 11:29:36 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/10/2024 02:10, Stefano Brivio wrote:
> >> +/**
> >> + * tcp_vu_prepare() - Prepare the packet header
> >> + * @c:		Execution context
> >> + * @conn:	Connection pointer
> >> + * @first:	Pointer to the array of IO vectors
> >> + * @dlen:	Packet data length
> >> + * @check:	Checksum, if already known
> >> + */
> >> +static void tcp_vu_prepare(const struct ctx *c,
> >> +			   struct tcp_tap_conn *conn, struct iovec *first,
> >> +			   size_t dlen, const uint16_t **check)
> >> +{
> >> +	const struct flowside *toside = TAPFLOW(conn);
> >> +	char *base = first->iov_base;
> >> +	struct ethhdr *eh;
> >> +
> >> +	/* we guess the first iovec provided by the guest can embed
> >> +	 * all the headers needed by L2 frame
> >> +	 */  
> > What happens if it doesn't (buggy guest)? Do we have a way to make sure
> > it's the case? I guess it's more straightforward to do this in
> > tcp_vu_data_from_sock() where we check and set iov_len (even though the
> > implication of VIRTIO_NET_F_MRG_RXBUF isn't totally clear to me).  
> 
> According to spec, minimum size of a buffer is 1526 bytes
> (https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2340003)
> 
> So if the guest is buggy, we will write in the guest memory out of the (buggy) provided 
> buffer, and we can crash the guest. But it's what happens to a buggy guest.
> 
> We can't fix the guest, IMO passt should crash in this case (add an ASSERT()?).

I think we should rather call tap_sock_reset() (see tap_passt_input())
so that the guest has a chance to reconnect (you implemented this in
QEMU...).

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-14 14:23       ` Stefano Brivio
@ 2024-11-14 15:16         ` Laurent Vivier
  2024-11-14 15:38           ` Stefano Brivio
  0 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-11-14 15:16 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 14/11/2024 15:23, Stefano Brivio wrote:
> On Thu, 14 Nov 2024 11:20:09 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> On 17/10/2024 02:10, Stefano Brivio wrote:
>>>> +		if (frame_size == 0)
>>>> +			first = &iov_vu[i + 1];
>>>> +
>>>> +		if (iov_vu[i + 1].iov_len > (size_t)len)
>>>> +			iov_vu[i + 1].iov_len = len;
>>>> +
>>>> +		len -= iov_vu[i + 1].iov_len;
>>>> +		iov_used++;
>>>> +
>>>> +		frame_size += iov_vu[i + 1].iov_len;
>>>> +		num_buffers++;
>>>> +
>>>> +		if (frame_size >= mss || len == 0 ||
>>>> +		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
>>>> +			if (i + 1 == iov_cnt)
>>>> +				check = NULL;
>>>> +
>>>> +			/* restore first iovec base: point to vnet header */
>>>> +			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
>>>> +
>>>> +			tcp_vu_prepare(c, conn, first, frame_size, &check);
>>>> +			if (*c->pcap)  {
>>>> +				tcp_vu_update_check(tapside, first, num_buffers);
>>>> +				pcap_iov(first, num_buffers,
>>>> +					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
>>>> +			}
>>>> +
>>>> +			conn->seq_to_tap += frame_size;
>>> We always increase this, even if, later...
>>>    
>>>> +
>>>> +			frame_size = 0;
>>>> +			num_buffers = 0;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* release unused buffers */
>>>> +	vu_queue_rewind(vq, iov_cnt - iov_used);
>>>> +
>>>> +	/* send packets */
>>>> +	vu_flush(vdev, vq, elem, iov_used);
>>> we fail to send packets, that is, even if vu_queue_fill_by_index()
>>> returns early because (!vq->vring.avail).
>>
>> vring.avail is a pointer to a structure. vring.avail is NULL if there is something wrong
>> during the initialization. It's imported code, I think it's only a sanity check.
>> So in theory vu_flush() cannot fail.
> 
> Oh, I see now. I actually think it's preferable to crash in that
> (theoretically impossible) case, without even an ASSERT() (we would
> dereference a NULL pointer, eventually, even if not here).
> 

So what you propose is to remove the "if (!vq->vring.avail) return;"?

Thanks
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-14 15:16         ` Laurent Vivier
@ 2024-11-14 15:38           ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-11-14 15:38 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 14 Nov 2024 16:16:48 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 14/11/2024 15:23, Stefano Brivio wrote:
> > On Thu, 14 Nov 2024 11:20:09 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> >   
> >> On 17/10/2024 02:10, Stefano Brivio wrote:  
> >>>> +		if (frame_size == 0)
> >>>> +			first = &iov_vu[i + 1];
> >>>> +
> >>>> +		if (iov_vu[i + 1].iov_len > (size_t)len)
> >>>> +			iov_vu[i + 1].iov_len = len;
> >>>> +
> >>>> +		len -= iov_vu[i + 1].iov_len;
> >>>> +		iov_used++;
> >>>> +
> >>>> +		frame_size += iov_vu[i + 1].iov_len;
> >>>> +		num_buffers++;
> >>>> +
> >>>> +		if (frame_size >= mss || len == 0 ||
> >>>> +		    i + 1 == iov_cnt || !vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> >>>> +			if (i + 1 == iov_cnt)
> >>>> +				check = NULL;
> >>>> +
> >>>> +			/* restore first iovec base: point to vnet header */
> >>>> +			vu_set_vnethdr(vdev, first, num_buffers, l2_hdrlen);
> >>>> +
> >>>> +			tcp_vu_prepare(c, conn, first, frame_size, &check);
> >>>> +			if (*c->pcap)  {
> >>>> +				tcp_vu_update_check(tapside, first, num_buffers);
> >>>> +				pcap_iov(first, num_buffers,
> >>>> +					 sizeof(struct virtio_net_hdr_mrg_rxbuf));
> >>>> +			}
> >>>> +
> >>>> +			conn->seq_to_tap += frame_size;  
> >>> We always increase this, even if, later...
> >>>      
> >>>> +
> >>>> +			frame_size = 0;
> >>>> +			num_buffers = 0;
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	/* release unused buffers */
> >>>> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> >>>> +
> >>>> +	/* send packets */
> >>>> +	vu_flush(vdev, vq, elem, iov_used);  
> >>> we fail to send packets, that is, even if vu_queue_fill_by_index()
> >>> returns early because (!vq->vring.avail).  
> >>
> >> vring.avail is a pointer to a structure. vring.avail is NULL if there is something wrong
> >> during the initialization. It's imported code, I think it's only a sanity check.
> >> So in theory vu_flush() cannot fail.  
> > 
> > Oh, I see now. I actually think it's preferable to crash in that
> > (theoretically impossible) case, without even an ASSERT() (we would
> > dereference a NULL pointer, eventually, even if not here).
> 
> So what you propose is to remove the "if (!vq->vring.avail) return;"?

Yes, all of them, actually. It also avoids confusion I think.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-14 14:23       ` Stefano Brivio
@ 2024-11-15  8:30         ` Laurent Vivier
  2024-11-15 10:08           ` Stefano Brivio
  0 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-11-15  8:30 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 14/11/2024 15:23, Stefano Brivio wrote:
> On Thu, 14 Nov 2024 11:23:11 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> On 17/10/2024 02:10, Stefano Brivio wrote:
>>>> +/**
>>>> + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
>>>> + *			     in window
>>>> + * @c:		Execution context
>>>> + * @conn:	Connection pointer
>>>> + *
>>>> + * Return: Negative on connection reset, 0 otherwise
>>>> + */
>>>> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>>>> +{
>>>> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>>>> +	struct vu_dev *vdev = c->vdev;
>>>> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
>>>> +	const struct flowside *tapside = TAPFLOW(conn);
>>>> +	uint16_t mss = MSS_GET(conn);
>>>> +	size_t l2_hdrlen, fillsize;
>>>> +	int i, iov_cnt, iov_used;
>>>> +	int v4 = CONN_V4(conn);
>>>> +	uint32_t already_sent = 0;
>>>> +	const uint16_t *check;
>>>> +	struct iovec *first;
>>>> +	int frame_size;
>>>> +	int num_buffers;
>>>> +	ssize_t len;
>>>> +
>>>> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
>>>> +		flow_err(conn,
>>>> +			 "Got packet, but RX virtqueue not usable yet");
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
>>>> +
>>>> +	if (SEQ_LT(already_sent, 0)) {
>>>> +		/* RFC 761, section 2.1. */
>>>> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
>>>> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
>>>> +		conn->seq_to_tap = conn->seq_ack_from_tap;
>>>> +		already_sent = 0;
>>>> +	}
>>>> +
>>>> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
>>>> +		conn_flag(c, conn, STALLED);
>>>> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/* Set up buffer descriptors we'll fill completely and partially. */
>>>> +
>>>> +	fillsize = wnd_scaled;
>>>> +
>>>> +	if (peek_offset_cap)
>>>> +		already_sent = 0;
>>>> +
>>>> +	iov_vu[0].iov_base = tcp_buf_discard;
>>>> +	iov_vu[0].iov_len = already_sent;
>>>> +	fillsize -= already_sent;
>>>> +
>>>> +	/* collect the buffers from vhost-user and fill them with the
>>>> +	 * data from the socket
>>>> +	 */
>>>> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
>>>> +	if (iov_cnt <= 0)
>>>> +		return iov_cnt;
>>>> +
>>>> +	len -= already_sent;
>>>> +	if (len <= 0) {
>>>> +		conn_flag(c, conn, STALLED);
>>>> +		vu_queue_rewind(vq, iov_cnt);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	conn_flag(c, conn, ~STALLED);
>>>> +
>>>> +	/* Likely, some new data was acked too. */
>>>> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
>>>> +
>>>> +	/* initialize headers */
>>>> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
>>>> +	iov_used = 0;
>>>> +	num_buffers = 0;
>>>> +	check = NULL;
>>>> +	frame_size = 0;
>>>> +
>>>> +	/* iov_vu is an array of buffers and the buffer size can be
>>>> +	 * smaller than the frame size we want to use but with
>>>> +	 * num_buffer we can merge several virtio iov buffers in one packet
>>>> +	 * we need only to set the packet headers in the first iov and
>>>> +	 * num_buffer to the number of iov entries
>>> ...this part is clear to me, what I don't understand is if we still
>>> have a way to guarantee that the sum of several buffers is big enough
>>> to fit frame_size bytes.
>>
>> We don't have this garantee. But I think it's the same for the socket version?
> 
> Well, there we do:
> 
> 	fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
> 	if (fill_bufs > TCP_FRAMES) {
> 		fill_bufs = TCP_FRAMES;
> 
> and we don't fetch more data than that from the socket (in one pass).
> 
> Is this implicit in the i < iov_cnt loop condition here? That's the part
> I don't understand: how do we limit the amount of data we can dequeue
> from a socket in one single pass.
> 

In the loop "i < iov_cnt" is the number of available buffers collected previously. Usually 
the size of one buffer is 1536 bytes. We join the buffers here (when 
VIRTIO_NET_F_MRG_RXBUF is avaialble) to create frame with a size of "mss".

iov_cnt is computed in tcp_vu_sock_recv(): this is the number of buffers we have collected 
from the queue to have enough space to store fillsize bytes. But if we don't have enough 
buffers in the queue ioc_cnt will be lower and the size of the data we will collect will 
be truncated.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-15  8:30         ` Laurent Vivier
@ 2024-11-15 10:08           ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-11-15 10:08 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri, 15 Nov 2024 09:30:45 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 14/11/2024 15:23, Stefano Brivio wrote:
> > On Thu, 14 Nov 2024 11:23:11 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> >   
> >> On 17/10/2024 02:10, Stefano Brivio wrote:  
> >>>> +/**
> >>>> + * tcp_vu_data_from_sock() - Handle new data from socket, queue to vhost-user,
> >>>> + *			     in window
> >>>> + * @c:		Execution context
> >>>> + * @conn:	Connection pointer
> >>>> + *
> >>>> + * Return: Negative on connection reset, 0 otherwise
> >>>> + */
> >>>> +int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> >>>> +{
> >>>> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> >>>> +	struct vu_dev *vdev = c->vdev;
> >>>> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> >>>> +	const struct flowside *tapside = TAPFLOW(conn);
> >>>> +	uint16_t mss = MSS_GET(conn);
> >>>> +	size_t l2_hdrlen, fillsize;
> >>>> +	int i, iov_cnt, iov_used;
> >>>> +	int v4 = CONN_V4(conn);
> >>>> +	uint32_t already_sent = 0;
> >>>> +	const uint16_t *check;
> >>>> +	struct iovec *first;
> >>>> +	int frame_size;
> >>>> +	int num_buffers;
> >>>> +	ssize_t len;
> >>>> +
> >>>> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> >>>> +		flow_err(conn,
> >>>> +			 "Got packet, but RX virtqueue not usable yet");
> >>>> +		return 0;
> >>>> +	}
> >>>> +
> >>>> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> >>>> +
> >>>> +	if (SEQ_LT(already_sent, 0)) {
> >>>> +		/* RFC 761, section 2.1. */
> >>>> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
> >>>> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
> >>>> +		conn->seq_to_tap = conn->seq_ack_from_tap;
> >>>> +		already_sent = 0;
> >>>> +	}
> >>>> +
> >>>> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
> >>>> +		conn_flag(c, conn, STALLED);
> >>>> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> >>>> +		return 0;
> >>>> +	}
> >>>> +
> >>>> +	/* Set up buffer descriptors we'll fill completely and partially. */
> >>>> +
> >>>> +	fillsize = wnd_scaled;
> >>>> +
> >>>> +	if (peek_offset_cap)
> >>>> +		already_sent = 0;
> >>>> +
> >>>> +	iov_vu[0].iov_base = tcp_buf_discard;
> >>>> +	iov_vu[0].iov_len = already_sent;
> >>>> +	fillsize -= already_sent;
> >>>> +
> >>>> +	/* collect the buffers from vhost-user and fill them with the
> >>>> +	 * data from the socket
> >>>> +	 */
> >>>> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, &len);
> >>>> +	if (iov_cnt <= 0)
> >>>> +		return iov_cnt;
> >>>> +
> >>>> +	len -= already_sent;
> >>>> +	if (len <= 0) {
> >>>> +		conn_flag(c, conn, STALLED);
> >>>> +		vu_queue_rewind(vq, iov_cnt);
> >>>> +		return 0;
> >>>> +	}
> >>>> +
> >>>> +	conn_flag(c, conn, ~STALLED);
> >>>> +
> >>>> +	/* Likely, some new data was acked too. */
> >>>> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
> >>>> +
> >>>> +	/* initialize headers */
> >>>> +	l2_hdrlen = tcp_vu_l2_hdrlen(!v4);
> >>>> +	iov_used = 0;
> >>>> +	num_buffers = 0;
> >>>> +	check = NULL;
> >>>> +	frame_size = 0;
> >>>> +
> >>>> +	/* iov_vu is an array of buffers and the buffer size can be
> >>>> +	 * smaller than the frame size we want to use but with
> >>>> +	 * num_buffer we can merge several virtio iov buffers in one packet
> >>>> +	 * we need only to set the packet headers in the first iov and
> >>>> +	 * num_buffer to the number of iov entries  
> >>> ...this part is clear to me, what I don't understand is if we still
> >>> have a way to guarantee that the sum of several buffers is big enough
> >>> to fit frame_size bytes.  
> >>
> >> We don't have this garantee. But I think it's the same for the socket version?  
> > 
> > Well, there we do:
> > 
> > 	fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
> > 	if (fill_bufs > TCP_FRAMES) {
> > 		fill_bufs = TCP_FRAMES;
> > 
> > and we don't fetch more data than that from the socket (in one pass).
> > 
> > Is this implicit in the i < iov_cnt loop condition here? That's the part
> > I don't understand: how do we limit the amount of data we can dequeue
> > from a socket in one single pass.
> >   
> 
> In the loop "i < iov_cnt" is the number of available buffers collected previously. Usually 
> the size of one buffer is 1536 bytes. We join the buffers here (when 
> VIRTIO_NET_F_MRG_RXBUF is avaialble) to create frame with a size of "mss".
> 
> iov_cnt is computed in tcp_vu_sock_recv(): this is the number of buffers we have collected 
> from the queue to have enough space to store fillsize bytes. But if we don't have enough 
> buffers in the queue ioc_cnt will be lower and the size of the data we will collect will 
> be truncated.

Oh, I see, like I suspected.

I find it clearer to actually iterate on the "source" (data we have to
transfer) rather than on the destination (buffers we might fill), as we
do in tcp_data_from_sock(), but there it's easier as we own the buffers
and we know they all have the same size.

Never mind then, I guess it's clear enough for now, maybe at some point
we'll find a reasonable way to turn "i < iov_cnt" into a stop condition
and iterate in the for loop on a different variable.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-14 14:23       ` Stefano Brivio
@ 2024-11-15 11:13         ` Laurent Vivier
  2024-11-15 11:23           ` Stefano Brivio
  0 siblings, 1 reply; 50+ messages in thread
From: Laurent Vivier @ 2024-11-15 11:13 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 14/11/2024 15:23, Stefano Brivio wrote:
> On Thu, 14 Nov 2024 11:29:36 +0100
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> On 17/10/2024 02:10, Stefano Brivio wrote:
>>>> +/**
>>>> + * tcp_vu_prepare() - Prepare the packet header
>>>> + * @c:		Execution context
>>>> + * @conn:	Connection pointer
>>>> + * @first:	Pointer to the array of IO vectors
>>>> + * @dlen:	Packet data length
>>>> + * @check:	Checksum, if already known
>>>> + */
>>>> +static void tcp_vu_prepare(const struct ctx *c,
>>>> +			   struct tcp_tap_conn *conn, struct iovec *first,
>>>> +			   size_t dlen, const uint16_t **check)
>>>> +{
>>>> +	const struct flowside *toside = TAPFLOW(conn);
>>>> +	char *base = first->iov_base;
>>>> +	struct ethhdr *eh;
>>>> +
>>>> +	/* we guess the first iovec provided by the guest can embed
>>>> +	 * all the headers needed by L2 frame
>>>> +	 */
>>> What happens if it doesn't (buggy guest)? Do we have a way to make sure
>>> it's the case? I guess it's more straightforward to do this in
>>> tcp_vu_data_from_sock() where we check and set iov_len (even though the
>>> implication of VIRTIO_NET_F_MRG_RXBUF isn't totally clear to me).
>>
>> According to spec, minimum size of a buffer is 1526 bytes
>> (https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2340003)
>>
>> So if the guest is buggy, we will write in the guest memory out of the (buggy) provided
>> buffer, and we can crash the guest. But it's what happens to a buggy guest.
>>
>> We can't fix the guest, IMO passt should crash in this case (add an ASSERT()?).
> 
> I think we should rather call tap_sock_reset() (see tap_passt_input())
> so that the guest has a chance to reconnect (you implemented this in
> QEMU...).
> 

If I add a call to tap_sock_reset() in tcp_vu_data_from_sock(), I need to remove all the 
"const" from "struct ctx *" in the entire call chain. It's a lot. I'm not sure it is worth it.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v8 7/8] vhost-user: add vhost-user
  2024-11-15 11:13         ` Laurent Vivier
@ 2024-11-15 11:23           ` Stefano Brivio
  0 siblings, 0 replies; 50+ messages in thread
From: Stefano Brivio @ 2024-11-15 11:23 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri, 15 Nov 2024 12:13:23 +0100
Laurent Vivier <lvivier@redhat.com> wrote:

> On 14/11/2024 15:23, Stefano Brivio wrote:
> > On Thu, 14 Nov 2024 11:29:36 +0100
> > Laurent Vivier <lvivier@redhat.com> wrote:
> >   
> >> On 17/10/2024 02:10, Stefano Brivio wrote:  
> >>>> +/**
> >>>> + * tcp_vu_prepare() - Prepare the packet header
> >>>> + * @c:		Execution context
> >>>> + * @conn:	Connection pointer
> >>>> + * @first:	Pointer to the array of IO vectors
> >>>> + * @dlen:	Packet data length
> >>>> + * @check:	Checksum, if already known
> >>>> + */
> >>>> +static void tcp_vu_prepare(const struct ctx *c,
> >>>> +			   struct tcp_tap_conn *conn, struct iovec *first,
> >>>> +			   size_t dlen, const uint16_t **check)
> >>>> +{
> >>>> +	const struct flowside *toside = TAPFLOW(conn);
> >>>> +	char *base = first->iov_base;
> >>>> +	struct ethhdr *eh;
> >>>> +
> >>>> +	/* we guess the first iovec provided by the guest can embed
> >>>> +	 * all the headers needed by L2 frame
> >>>> +	 */  
> >>> What happens if it doesn't (buggy guest)? Do we have a way to make sure
> >>> it's the case? I guess it's more straightforward to do this in
> >>> tcp_vu_data_from_sock() where we check and set iov_len (even though the
> >>> implication of VIRTIO_NET_F_MRG_RXBUF isn't totally clear to me).  
> >>
> >> According to spec, minimum size of a buffer is 1526 bytes
> >> (https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2340003)
> >>
> >> So if the guest is buggy, we will write in the guest memory out of the (buggy) provided
> >> buffer, and we can crash the guest. But it's what happens to a buggy guest.
> >>
> >> We can't fix the guest, IMO passt should crash in this case (add an ASSERT()?).  
> > 
> > I think we should rather call tap_sock_reset() (see tap_passt_input())
> > so that the guest has a chance to reconnect (you implemented this in
> > QEMU...).
> 
> If I add a call to tap_sock_reset() in tcp_vu_data_from_sock(), I need to remove all the 
> "const" from "struct ctx *" in the entire call chain. It's a lot. I'm not sure it is worth it.

...or you drop c->fd_tap = -1 from tap_sock_reset(). At a glance it
looks fine to me, we never check it afterwards, but please double
check. :)

If that's not feasible, yeah, I guess an ASSERT() will do for the
moment.

-- 
Stefano


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-11-15 11:23 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-10 12:28 [PATCH v8 0/8] Add vhost-user support to passt. (part 3) Laurent Vivier
2024-10-10 12:28 ` [PATCH v8 1/8] packet: replace struct desc by struct iovec Laurent Vivier
2024-10-10 12:28 ` [PATCH v8 2/8] vhost-user: introduce virtio API Laurent Vivier
2024-10-10 12:28 ` [PATCH v8 3/8] vhost-user: introduce vhost-user API Laurent Vivier
2024-10-10 12:28 ` [PATCH v8 4/8] udp: Prepare udp.c to be shared with vhost-user Laurent Vivier
2024-10-14  4:29   ` David Gibson
2024-10-10 12:28 ` [PATCH v8 5/8] tcp: Export headers functions Laurent Vivier
2024-10-14  4:29   ` David Gibson
2024-10-10 12:29 ` [PATCH v8 6/8] passt: rename tap_sock_init() to tap_backend_init() Laurent Vivier
2024-10-14  4:30   ` David Gibson
2024-10-14 22:38   ` Stefano Brivio
2024-10-10 12:29 ` [PATCH v8 7/8] vhost-user: add vhost-user Laurent Vivier
2024-10-15  3:23   ` David Gibson
2024-10-16 10:07     ` Laurent Vivier
2024-10-16 16:26       ` Stefano Brivio
2024-10-15 19:54   ` Stefano Brivio
2024-10-16  0:41     ` David Gibson
2024-10-17  0:10       ` Stefano Brivio
2024-10-17 11:25         ` Stefano Brivio
2024-10-17 11:54           ` Laurent Vivier
2024-10-17 17:18           ` Laurent Vivier
2024-10-17 17:25             ` Laurent Vivier
2024-10-17 17:33             ` Stefano Brivio
2024-10-17 21:21               ` Stefano Brivio
2024-10-22 12:59         ` Laurent Vivier
2024-10-22 13:19           ` Stefano Brivio
2024-10-22 18:19             ` Stefano Brivio
2024-10-22 18:22               ` Stefano Brivio
2024-10-23 15:27           ` Laurent Vivier
2024-10-23 16:23             ` Stefano Brivio
2024-10-17  0:10   ` Stefano Brivio
2024-10-17  7:28     ` Laurent Vivier
2024-10-17  8:33       ` Stefano Brivio
2024-11-14 10:20     ` Laurent Vivier
2024-11-14 14:23       ` Stefano Brivio
2024-11-14 15:16         ` Laurent Vivier
2024-11-14 15:38           ` Stefano Brivio
2024-11-14 10:23     ` Laurent Vivier
2024-11-14 14:23       ` Stefano Brivio
2024-11-15  8:30         ` Laurent Vivier
2024-11-15 10:08           ` Stefano Brivio
2024-11-14 10:29     ` Laurent Vivier
2024-11-14 14:23       ` Stefano Brivio
2024-11-15 11:13         ` Laurent Vivier
2024-11-15 11:23           ` Stefano Brivio
2024-10-10 12:29 ` [PATCH v8 8/8] test: Add tests for passt in vhost-user mode Laurent Vivier
2024-10-15  3:40   ` David Gibson
2024-10-15 19:54   ` Stefano Brivio
2024-10-16  8:06     ` Laurent Vivier
2024-10-16  9:47       ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).