public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [RFC v2 00/11] Add vhost-net kernel support
@ 2025-07-09 17:47 Eugenio Pérez
  2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

vhost-net is a kernel device that allows to read packets from a tap
device using virtio queues instead of regular read(2) and write(2).
This enables a more eficient packet processing, as the memory can
be written directly by the kernel to the userspace and back, instead
of wasting bandwith on copies, and it enables to batch many packets
in a single notification (through eventfds) both tx and rx.

Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s.

Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s.

RFC: At this moment only these are supported:
* Receive l2 packets from the vhost kernel to pasta
* Send l4 tcp socket received data through vhost-kernel to namespace.

TODO: Add vhost zerocopy in the tests, and compare with veth.
TODO: Implement at least UDP tx.  Or maybe we want UDP to be write(2) because
of latency?
TODO: Check style for variable declarations in for loops and use of curly
brackets as long as they wrap more than a line.
TODO: kerneldoc style function header comments

--
v2: Add TCP tx, and integrated some comments from the previous series. Please
    check each patch message for details.

Eugenio Pérez (11):
  tap: implement vhost_call_cb
  tap: add die() on vhost error
  Replace tx tap hdr with virtio_nethdr_mrg_rxbuf
  tcp: export memory regions to vhost
  virtio: Fill .next in tx queue
  tap: move static iov_sock to tcp_buf_data_from_sock
  tap: support tx through vhost
  tap: add tap_free_old_xmit
  tcp: start conversion to circular buffer
  Add poll(2) to used_idx
  tcp_buf: adding TCP tx circular buffer

 arp.c        |   2 +-
 epoll_type.h |   4 +
 passt.c      |  12 +-
 passt.h      |  11 +-
 tap.c        | 489 +++++++++++++++++++++++++++++++++++++++++++++++++--
 tap.h        |  13 +-
 tcp.c        |   2 +-
 tcp_buf.c    | 179 +++++++++++++++----
 tcp_buf.h    |  19 ++
 udp.c        |   2 +-
 10 files changed, 675 insertions(+), 58 deletions(-)

-- 
2.50.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC v2 01/11] tap: implement vhost_call_cb
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-23  6:56   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

Implement the rx side of the vhost-kernel, where the namespace send
packets through its tap device and the kernel passes them to
pasta using a virtqueue.  The virtqueue is build using a
virtual ring (vring), which includes:
* A descriptor table (buffer info)
* An available ring (buffers ready for the kernel)
* A used ring (buffers that the kernel has filled)

The descriptor table holds an array of buffers defined by address and
length. The kernel writes the packets transmitted by the namespace into
these buffers.  The number of descriptors (vq size) is set by
VHOST_NDESCS.  Pasta fills this table using pkt_buf, splitting it
evenly across all descriptors.  This table is read-only for the kernel.

The available ring is where pasta marks which buffers the kernel can
use.  It's read only for the kernel.  It includes a ring[] array with
the descriptor indexes and an avail->idx index. Pasta increments
avail->idx when a new buffer is added, modulo the size of the
virtqueue.  As pasta writes the rx buffers sequentially, ring[] is
always [0, 1, 2...] and only avail->idx is incremented when new buffers
are available for the kernel. avail->idx can be incremented by more
than one at a time.

Pasta also notifies the kernel of new available buffers by writing to
the kick eventfd.

Once the kernel has written a frame in a descriptor it writes its index
into used_ring->ring[] and increments the used_ring->idx accordly.
Like the avail idx the kernel can increase it by more than one.  Pasta
gets a notification in the call eventfd, so we add it into the epoll ctx.

Pasta assumes buffers are used in order. QEMU also assumes it in the
virtio-net migration code so it is safe.

Now, vhost-kernel is designed to read the virtqueues and the buffers as
*guest* physical addresses (GPA), not process virtual addresses (HVA).
The way QEMU tells the translations is through the memory regions.
Since we don't have GPAs, let's just send the memory regions as a 1:1
translations of the HVA.

TODO: Evaluate if we can reuse the tap fd code instead of making a new
epoll event type.
TODO: Split a new file for vhost (Stefano)

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
RFC v2:
* Need to integrate "now" parameter in tap_add_packet and replace
TAP_MSGS to TAP_MSGS{4,6}.
* Actually refill rx queue at the end of operation.
* Explain virtqueues and memory regions theory of operation.
* Extrack vhost_kick so it can be reused by tx.
* Add macro for VHOST regions.
* Only register call_cb in rx queue, as tx calls are handled just if
  needed.
* Renamed vhost_call_cb to tap_vhost_input (David)
* Move the inuse and last_used_idx to a writable struct from vhost tx
  function. Context is readonly for vhost tx code path.
* Changed from inuse to num_free tracking, more aligned with kernel drv
* Use always the same tap_pool instead of having the two splitted for v4
  and v6.
* Space between (struct vhost_memory_region) {.XXX = ...}. (Stefano)
* Expand comments about memory region size (David)
* Add some TODOs.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 epoll_type.h |   2 +
 passt.c      |   7 +-
 passt.h      |  10 +-
 tap.c        | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 tap.h        |   8 ++
 5 files changed, 335 insertions(+), 3 deletions(-)

diff --git a/epoll_type.h b/epoll_type.h
index 12ac59b..0371c14 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -44,6 +44,8 @@ enum epoll_type {
 	EPOLL_TYPE_REPAIR_LISTEN,
 	/* TCP_REPAIR helper socket */
 	EPOLL_TYPE_REPAIR,
+	/* vhost-kernel call socket */
+	EPOLL_TYPE_VHOST_CALL,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/passt.c b/passt.c
index 388d10f..0f2659c 100644
--- a/passt.c
+++ b/passt.c
@@ -79,6 +79,7 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
 	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
 	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
+	[EPOLL_TYPE_VHOST_CALL]		= "vhost-kernel call socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -310,7 +311,8 @@ loop:
 
 		switch (ref.type) {
 		case EPOLL_TYPE_TAP_PASTA:
-			tap_handler_pasta(&c, eventmask, &now);
+			// TODO: Find a better way, maybe reuse the fd.
+			// tap_handler_pasta(&c, eventmask, &now);
 			break;
 		case EPOLL_TYPE_TAP_PASST:
 			tap_handler_passt(&c, eventmask, &now);
@@ -357,6 +359,9 @@ loop:
 		case EPOLL_TYPE_REPAIR:
 			repair_handler(&c, eventmask);
 			break;
+		case EPOLL_TYPE_VHOST_CALL:
+			tap_vhost_input(&c, ref, &now);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index 8693794..7bb86c4 100644
--- a/passt.h
+++ b/passt.h
@@ -45,7 +45,7 @@ union epoll_ref;
  * @icmp:	ICMP-specific reference part
  * @data:	Data handled by protocol handlers
  * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
- * @queue:	vhost-user queue index for this fd
+ * @queue:	vhost queue index for this fd
  * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
  */
 union epoll_ref {
@@ -269,11 +269,14 @@ struct ctx {
 	int fd_tap;
 	int fd_repair_listen;
 	int fd_repair;
+	/* TODO document all added fields */
+	int fd_vhost;
 	unsigned char our_tap_mac[ETH_ALEN];
 	unsigned char guest_mac[ETH_ALEN];
 	uint16_t mtu;
 
 	uint64_t hash_secret[2];
+	uint64_t virtio_features;
 
 	int ifi4;
 	struct ip4_ctx ip4;
@@ -297,6 +300,11 @@ struct ctx {
 	int no_icmp;
 	struct icmp_ctx icmp;
 
+	struct {
+		int kick_fd;
+		int call_fd;
+	} vq[2];
+
 	int no_dns;
 	int no_dns_search;
 	int no_dhcp_dns;
diff --git a/tap.c b/tap.c
index 6db5d88..e4a3822 100644
--- a/tap.c
+++ b/tap.c
@@ -31,6 +31,7 @@
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
+#include <sys/eventfd.h>
 #include <sys/uio.h>
 #include <stdbool.h>
 #include <stdlib.h>
@@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
 #define TAP_SEQS		128 /* Different L4 tuples in one batch */
 #define FRAGMENT_MSG_RATE	10  /* # seconds between fragment warnings */
 
+#define VHOST_VIRTIO         0xAF
+#define VHOST_GET_FEATURES   _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES   _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER	     _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
+#define VHOST_SET_VRING_NUM  _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR  _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
+static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
+			 "Number of vhost descs must be a power of two by standard");
+static struct {
+	/* Number of free descriptors */
+	uint16_t num_free;
+
+	/* Last used idx processed */
+	uint16_t last_used_idx;
+} vqs[2];
+
+static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
+static union {
+	struct vring_avail avail;
+	char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];
+} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE)));
+static union {
+	struct vring_used used;
+	char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
+} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));
+
+/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
+#define N_VHOST_REGIONS 6
+union {
+	struct vhost_memory mem;
+	char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])];
+} vhost_memory = {
+	.mem = {
+		.nregions = N_VHOST_REGIONS,
+	},
+};
+
 /**
  * tap_l2_max_len() - Maximum frame size (including L2 header) for current mode
  * @c:		Execution context
@@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c,
 	tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
 }
 
+static void vhost_kick(struct vring_used *used, int kick_fd) {
+	/* We need to expose available array entries before checking avail
+	 * event.
+	 *
+	 * TODO: Does eventfd_write already do this?
+	 */
+	smp_mb();
+
+	if (!(used->flags & VRING_USED_F_NO_NOTIFY))
+		eventfd_write(kick_fd, 1);
+}
+
 /**
  * tap_send_frames_pasta() - Send multiple frames to the pasta tap
  * @c:			Execution context
@@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 	tap_start_connection(c);
 }
 
+static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len)
+{
+	struct vring_used *used = !qid ? &vring_used_0.used : &vring_used_1.used;
+	uint32_t i;
+	uint16_t used_idx, last_used;
+
+	/* TODO think if this has races with previous eventfd_read */
+	/* TODO we could improve performance with a shadow_used_idx */
+	used_idx = le16toh(used->idx);
+
+	smp_rmb();
+
+	if (used_idx == vqs[0].last_used_idx) {
+		*len = 0;
+		return NULL;
+	}
+
+	last_used = vqs[0].last_used_idx % VHOST_NDESCS;
+	i = le32toh(used->ring[last_used].id);
+	*len = le32toh(used->ring[last_used].len);
+
+	/* Make sure the kernel is consuming the descriptors in order */
+	if (i != last_used) {
+		die("vhost: id %u at used position %u != %u", i, last_used, i);
+		return NULL;
+	}
+
+	if (*len > PKT_BUF_BYTES/VHOST_NDESCS) {
+		warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDESCS);
+		return NULL;
+	}
+
+	/* TODO check if the id is valid and it has not been double used */
+	vqs[0].last_used_idx++;
+	vqs[0].num_free++;
+	return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);
+}
+
+/* TODO this assumes the kernel consumes descriptors in order */
+static void rx_pkt_refill(struct ctx *c)
+{
+	/* TODO: tune this threshold */
+	if (!vqs[0].num_free)
+		return;
+
+	vring_avail_0.avail.idx += vqs[0].num_free;
+	vqs[0].num_free = 0;
+	vhost_kick(&vring_used_0.used, c->vq[0].kick_fd);
+}
+
+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now)
+{
+	eventfd_read(ref.fd, (eventfd_t[]){ 0 });
+
+	tap_flush_pools();
+
+	while (true) {
+		struct virtio_net_hdr_mrg_rxbuf *hdr;
+		unsigned len;
+
+		hdr = virtqueue_get_rx_buf(ref.queue, &len);
+		if (!hdr)
+			break;
+
+		if (len < sizeof(*hdr)) {
+			warn("vhost: invalid len %u", len);
+			continue;
+		}
+
+		/* TODO this will break from this moment */
+		if (hdr->num_buffers != 1) {
+			warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);
+			continue;
+		}
+
+		/* TODO fix the v6 pool to support ipv6 */
+		tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), now);
+	}
+
+	tap_handler(c, now);
+	rx_pkt_refill(c);
+}
+
 /**
  * tap_ns_tun() - Get tuntap fd in namespace
  * @c:		Execution context
@@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
  */
 static int tap_ns_tun(void *arg)
 {
+	/* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and VHOST_NET_F_VIRTIO_NET_HDR actually */
+	static const uint64_t features =
+		(1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);
 	struct ifreq ifr = { .ifr_flags = IFF_TAP | IFF_NO_PI };
 	int flags = O_RDWR | O_NONBLOCK | O_CLOEXEC;
 	struct ctx *c = (struct ctx *)arg;
-	int fd, rc;
+	unsigned i;
+	int fd, vhost_fd, rc;
 
 	c->fd_tap = -1;
 	memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ);
@@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg)
 	if (fd < 0)
 		die_perror("Failed to open() /dev/net/tun");
 
+	vhost_fd = open("/dev/vhost-net", flags);
+	if (vhost_fd < 0)
+		die_perror("Failed to open() /dev/vhost-net");
+
+	rc = ioctl(vhost_fd, VHOST_SET_OWNER, NULL);
+	if (rc < 0)
+		die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed");
+
+	rc = ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features);
+	if (rc < 0)
+		die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed");
+
+	/* TODO inform more explicitely */
+	fprintf(stderr, "vhost features: %lx\n", c->virtio_features);
+	fprintf(stderr, "req features: %lx\n", features);
+	c->virtio_features &= features;
+	if (c->virtio_features != features)
+		die("vhost does not support required features");
+
+	for (i = 0; i < ARRAY_SIZE(c->vq); i++) {
+		struct vhost_vring_file file = {
+			.index = i,
+		};
+		union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_CALL,
+					.queue = i };
+		struct epoll_event ev;
+
+		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+		ref.fd = file.fd;
+		if (file.fd < 0)
+			die_perror("Failed to create call eventfd");
+
+		rc = ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file);
+		if (rc < 0)
+			die_perror(
+				"VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed");
+
+		ev = (struct epoll_event){ .data.u64 = ref.u64, .events = EPOLLIN };
+		if (i == 0) {
+			rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+			if (rc < 0)
+				die_perror("Failed to add call eventfd to epoll");
+		}
+		c->vq[i].call_fd = file.fd;
+	}
+
+	/*
+	 * Unlike most of C-style APIs, userspace_addr+memory_size is
+	 * also accesible by the kernel.  Include a -1 to adjust.
+	 */
+#define VHOST_MEMORY_REGION_PTR(addr, size) \
+	(struct vhost_memory_region) { \
+		.guest_phys_addr = (uintptr_t)addr, \
+		.memory_size = size - 1, \
+		.userspace_addr = (uintptr_t)addr, \
+	}
+#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeof(elem))
+
+	/* 1:1 translation */
+	vhost_memory.mem.regions[0] = VHOST_MEMORY_REGION(vring_desc);
+	vhost_memory.mem.regions[1] = VHOST_MEMORY_REGION(vring_avail_0);
+	vhost_memory.mem.regions[2] = VHOST_MEMORY_REGION(vring_avail_1);
+	vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0);
+	vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1);
+	vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf);
+	static_assert(5 < N_VHOST_REGIONS);
+#undef VHOST_MEMORY_REGION
+#undef VHOST_MEMORY_REGION_PTR
+
+	rc = ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem);
+	if (rc < 0)
+		die_perror(
+			"VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed");
+
+	/* TODO: probably it increases RX perf */
+#if 0
+	struct ifreq ifr;
+	memset(&ifr, 0, sizeof(ifr));
+
+	if (ioctl(fd, TUNGETIFF, &ifr) != 0)
+		die_perror("Unable to query TUNGETIFF on FD %d", fd);
+	}
+
+	if (ifr.ifr_flags & IFF_VNET_HDR)
+		net->dev.features &= ~(1ULL << VIRTIO_NET_F_MRG_RXBUF);
+#endif
+	rc = ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features);
+	if (rc < 0)
+		die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed");
+
+	/* Duplicating foreach queue to follow the exact order from QEMU */
+	for (i = 0; i < ARRAY_SIZE(c->vq); i++) {
+		struct vhost_vring_addr addr = {
+			.index = i,
+			.desc_user_addr = (unsigned long)vring_desc[i],
+			.avail_user_addr = i == 0 ? (unsigned long)&vring_avail_0 :
+										(unsigned long)&vring_avail_1,
+			.used_user_addr = i == 0 ? (unsigned long)&vring_used_0 :
+										(unsigned long)&vring_used_1,
+			/* GPA addr */
+			.log_guest_addr = i == 0 ? (unsigned long)&vring_used_0 :
+									   (unsigned long)&vring_used_1,
+		};
+		struct vhost_vring_state state = {
+			.index = i,
+			.num = VHOST_NDESCS,
+		};
+		struct vhost_vring_file file = {
+			.index = i,
+		};
+
+		rc = ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state);
+		if (rc < 0)
+			die_perror(
+				"VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed");
+
+		fprintf(stderr, "qid: %d\n", i);
+		fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr);
+		fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr);
+		fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr);
+		rc = ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr);
+		if (rc < 0)
+			die_perror(
+				"VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed");
+
+		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+		if (file.fd < 0)
+			die_perror("Failed to create kick eventfd");
+		rc = ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file);
+		if (rc < 0)
+			die_perror(
+				"VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed");
+		c->vq[i].kick_fd = file.fd;
+
+		vqs[i].num_free = VHOST_NDESCS;
+	}
+
 	rc = ioctl(fd, (int)TUNSETIFF, &ifr);
 	if (rc < 0)
 		die_perror("TUNSETIFF ioctl on /dev/net/tun failed");
@@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg)
 	if (!(c->pasta_ifi = if_nametoindex(c->pasta_ifn)))
 		die("Tap device opened but no network interface found");
 
+	for (i = 0; i < VHOST_NDESCS; ++i) {
+		vring_desc[0][i].addr = (uintptr_t)pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);
+		vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
+		vring_desc[0][i].flags = VRING_DESC_F_WRITE;
+	}
+	for (i = 0; i < VHOST_NDESCS; ++i) {
+		vring_avail_0.avail.ring[i] = i;
+	}
+
+	rx_pkt_refill(c);
+
+	/* Duplicating foreach queue to follow the exact order from QEMU */
+	for (int i = 0; i < ARRAY_SIZE(c->vq); i++) {
+		struct vhost_vring_file file = {
+			.index = i,
+			.fd = fd,
+		};
+
+		fprintf(stderr, "qid: %d\n", file.index);
+		fprintf(stderr, "tap fd: %d\n", file.fd);
+		rc = ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file);
+		if (rc < 0)
+			die_perror(
+				"VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed");
+	}
+
 	c->fd_tap = fd;
+	c->fd_vhost = vhost_fd;
 
 	return 0;
 }
diff --git a/tap.h b/tap.h
index 6fe3d15..ff8cee5 100644
--- a/tap.h
+++ b/tap.h
@@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 		thdr->vnet_len = htonl(l2len);
 }
 
+/**
+ * tap_vhost_input() - Handler for new data on the socket to hypervisor vq
+ * @c:		Execution context
+ * @ref:	epoll reference
+ * @now:	Current timestamp
+ */
+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now);
+
 unsigned long tap_l2_max_len(const struct ctx *c);
 void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
 void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
-- 
@@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 		thdr->vnet_len = htonl(l2len);
 }
 
+/**
+ * tap_vhost_input() - Handler for new data on the socket to hypervisor vq
+ * @c:		Execution context
+ * @ref:	epoll reference
+ * @now:	Current timestamp
+ */
+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now);
+
 unsigned long tap_l2_max_len(const struct ctx *c);
 void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
 void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 02/11] tap: add die() on vhost error
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
  2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-23  6:58   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

In case the kernel needs to signal an error.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
RFC v2: Add TODO (David)

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 epoll_type.h |  2 ++
 passt.c      |  5 +++++
 passt.h      |  1 +
 tap.c        | 16 ++++++++++++++++
 4 files changed, 24 insertions(+)

diff --git a/epoll_type.h b/epoll_type.h
index 0371c14..93c4701 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -46,6 +46,8 @@ enum epoll_type {
 	EPOLL_TYPE_REPAIR,
 	/* vhost-kernel call socket */
 	EPOLL_TYPE_VHOST_CALL,
+	/* vhost-kernel error socket */
+	EPOLL_TYPE_VHOST_ERROR,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/passt.c b/passt.c
index 0f2659c..d839f5a 100644
--- a/passt.c
+++ b/passt.c
@@ -80,6 +80,7 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
 	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
 	[EPOLL_TYPE_VHOST_CALL]		= "vhost-kernel call socket",
+	[EPOLL_TYPE_VHOST_ERROR]	= "vhost-kernel error socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -362,6 +363,10 @@ loop:
 		case EPOLL_TYPE_VHOST_CALL:
 			tap_vhost_input(&c, ref, &now);
 			break;
+		case EPOLL_TYPE_VHOST_ERROR:
+			/* TODO re-initialize vhost */
+			die("Error on vhost-kernel socket");
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index 7bb86c4..0066145 100644
--- a/passt.h
+++ b/passt.h
@@ -303,6 +303,7 @@ struct ctx {
 	struct {
 		int kick_fd;
 		int call_fd;
+		int err_fd;
 	} vq[2];
 
 	int no_dns;
diff --git a/tap.c b/tap.c
index e4a3822..0c49e6d 100644
--- a/tap.c
+++ b/tap.c
@@ -1598,6 +1598,22 @@ static int tap_ns_tun(void *arg)
 				die_perror("Failed to add call eventfd to epoll");
 		}
 		c->vq[i].call_fd = file.fd;
+
+		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+		if (file.fd < 0)
+			die_perror("Failed to create error eventfd");
+
+		rc = ioctl(vhost_fd, VHOST_SET_VRING_ERR, &file);
+		if (rc < 0)
+			die_perror(
+				"VHOST_SET_VRING_ERR ioctl on /dev/vhost-net failed");
+
+		ref.type = EPOLL_TYPE_VHOST_ERROR, ref.fd = file.fd;
+		ev.data.u64 = ref.u64;
+		rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+		if (rc < 0)
+			die_perror("Failed to add error eventfd to epoll");
+		c->vq[i].err_fd = file.fd;
 	}
 
 	/*
-- 
@@ -1598,6 +1598,22 @@ static int tap_ns_tun(void *arg)
 				die_perror("Failed to add call eventfd to epoll");
 		}
 		c->vq[i].call_fd = file.fd;
+
+		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+		if (file.fd < 0)
+			die_perror("Failed to create error eventfd");
+
+		rc = ioctl(vhost_fd, VHOST_SET_VRING_ERR, &file);
+		if (rc < 0)
+			die_perror(
+				"VHOST_SET_VRING_ERR ioctl on /dev/vhost-net failed");
+
+		ref.type = EPOLL_TYPE_VHOST_ERROR, ref.fd = file.fd;
+		ev.data.u64 = ref.u64;
+		rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+		if (rc < 0)
+			die_perror("Failed to add error eventfd to epoll");
+		c->vq[i].err_fd = file.fd;
 	}
 
 	/*
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
  2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
  2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-24  0:17   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

vhost kernel expects this as the first data of the frame.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tap.c     |  2 +-
 tcp_buf.c | 15 +++++++++++++--
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/tap.c b/tap.c
index 0c49e6d..0656294 100644
--- a/tap.c
+++ b/tap.c
@@ -593,7 +593,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 		      nframes - m, nframes);

 	pcap_multiple(iov, bufs_per_frame, m,
-		      c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
+		      c->mode == MODE_PASST ? sizeof(uint32_t) : sizeof(struct virtio_net_hdr_mrg_rxbuf));

 	return m;
 }
diff --git a/tcp_buf.c b/tcp_buf.c
index d1fca67..2fbd056 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -22,6 +22,8 @@

 #include <netinet/tcp.h>

+#include <linux/virtio_net.h>
+
 #include "util.h"
 #include "ip.h"
 #include "iov.h"
@@ -43,7 +45,7 @@
 static struct ethhdr		tcp4_eth_src;
 static struct ethhdr		tcp6_eth_src;

-static struct tap_hdr		tcp_payload_tap_hdr[TCP_FRAMES_MEM];
+static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];

 /* IP headers for IPv4 and IPv6 */
 struct iphdr		tcp4_payload_ip[TCP_FRAMES_MEM];
@@ -75,6 +77,14 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
 	eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
 }

+static inline struct iovec virtio_net_hdr_iov(struct virtio_net_hdr_mrg_rxbuf *hdr)
+{
+        return (struct iovec){
+                .iov_base = hdr,
+                .iov_len = sizeof(*hdr),
+        };
+}
+
 /**
  * tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
  * @c:		Execution context
@@ -85,6 +95,7 @@ void tcp_sock_iov_init(const struct ctx *c)
 	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
 	int i;

+	(void)c;
 	tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
 	tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);

@@ -96,7 +107,7 @@ void tcp_sock_iov_init(const struct ctx *c)
 	for (i = 0; i < TCP_FRAMES_MEM; i++) {
 		struct iovec *iov = tcp_l2_iov[i];

-		iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]);
+		iov[TCP_IOV_TAP] = virtio_net_hdr_iov(&tcp_payload_tap_hdr[i]);
 		iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
 		iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i];
 	}
--
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 04/11] tcp: export memory regions to vhost
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (2 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-23  7:06   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

So vhost kernel is able to access the TCP buffers.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tap.c     | 14 +++++++++++---
 tcp_buf.c | 14 ++++----------
 tcp_buf.h | 19 +++++++++++++++++++
 3 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/tap.c b/tap.c
index 0656294..8b3ec45 100644
--- a/tap.c
+++ b/tap.c
@@ -63,6 +63,8 @@
 #include "vhost_user.h"
 #include "vu_common.h"

+#include "tcp_buf.h"
+
 /* Maximum allowed frame lengths (including L2 header) */

 /* Verify that an L2 frame length limit is large enough to contain the header,
@@ -136,8 +138,8 @@ static union {
 	char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
 } vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));

-/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
-#define N_VHOST_REGIONS 6
+/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf + TCP virtio hdr + TCP eth(src,dst) + TCP ip hdr */
+#define N_VHOST_REGIONS 12
 union {
 	struct vhost_memory mem;
 	char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])];
@@ -1635,7 +1637,13 @@ static int tap_ns_tun(void *arg)
 	vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0);
 	vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1);
 	vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf);
-	static_assert(5 < N_VHOST_REGIONS);
+	vhost_memory.mem.regions[6] = VHOST_MEMORY_REGION(tcp_payload_tap_hdr);
+	vhost_memory.mem.regions[7] = VHOST_MEMORY_REGION(tcp4_eth_src);
+	vhost_memory.mem.regions[8] = VHOST_MEMORY_REGION(tcp6_eth_src);
+	vhost_memory.mem.regions[9] = VHOST_MEMORY_REGION(tcp4_payload_ip);
+	vhost_memory.mem.regions[10] = VHOST_MEMORY_REGION(tcp6_payload_ip);
+	vhost_memory.mem.regions[11] = VHOST_MEMORY_REGION(tcp_payload);
+	static_assert(11 < N_VHOST_REGIONS);
 #undef VHOST_MEMORY_REGION
 #undef VHOST_MEMORY_REGION_PTR

diff --git a/tcp_buf.c b/tcp_buf.c
index 2fbd056..c999d2e 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -22,8 +22,6 @@

 #include <netinet/tcp.h>

-#include <linux/virtio_net.h>
-
 #include "util.h"
 #include "ip.h"
 #include "iov.h"
@@ -35,24 +33,20 @@
 #include "tcp_internal.h"
 #include "tcp_buf.h"

-#define TCP_FRAMES_MEM			128
-#define TCP_FRAMES							   \
-	(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
-
 /* Static buffers */

 /* Ethernet header for IPv4 and IPv6 frames */
-static struct ethhdr		tcp4_eth_src;
-static struct ethhdr		tcp6_eth_src;
+struct ethhdr		tcp4_eth_src;
+struct ethhdr		tcp6_eth_src;

-static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
+struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];

 /* IP headers for IPv4 and IPv6 */
 struct iphdr		tcp4_payload_ip[TCP_FRAMES_MEM];
 struct ipv6hdr		tcp6_payload_ip[TCP_FRAMES_MEM];

 /* TCP segments with payload for IPv4 and IPv6 frames */
-static struct tcp_payload_t	tcp_payload[TCP_FRAMES_MEM];
+struct tcp_payload_t	tcp_payload[TCP_FRAMES_MEM];

 static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516");
 static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516");
diff --git a/tcp_buf.h b/tcp_buf.h
index 54f5e53..7ae2536 100644
--- a/tcp_buf.h
+++ b/tcp_buf.h
@@ -6,9 +6,28 @@
 #ifndef TCP_BUF_H
 #define TCP_BUF_H

+#include <linux/virtio_net.h>
+
+#include "tcp_conn.h"
+#include "tcp_internal.h"
+
 void tcp_sock_iov_init(const struct ctx *c);
 void tcp_payload_flush(const struct ctx *c);
+struct tcp_tap_conn;
 int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
 int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);

+#define TCP_FRAMES_MEM			128
+#define TCP_FRAMES							   \
+(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
+
+extern struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
+extern struct tcp_payload_t	tcp_payload[TCP_FRAMES_MEM];
+
+extern struct ethhdr		tcp4_eth_src;
+extern struct ethhdr		tcp6_eth_src;
+
+extern struct iphdr		tcp4_payload_ip[TCP_FRAMES_MEM];
+extern struct ipv6hdr		tcp6_payload_ip[TCP_FRAMES_MEM];
+
 #endif  /*TCP_BUF_H */
--
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 05/11] virtio: Fill .next in tx queue
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (3 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-23  7:07   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

This way we can send one frame splitted in many buffers. TCP stack
already works this way in pasta.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tap.c b/tap.c
index 8b3ec45..5667fbe 100644
--- a/tap.c
+++ b/tap.c
@@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg)
 		vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
 		vring_desc[0][i].flags = VRING_DESC_F_WRITE;
 	}
+	for (i = 0; i < (VHOST_NDESCS - 1); ++i) {
+		vring_desc[1][i].next = i+1;
+	}
 	for (i = 0; i < VHOST_NDESCS; ++i) {
 		vring_avail_0.avail.ring[i] = i;
 	}
-- 
@@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg)
 		vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
 		vring_desc[0][i].flags = VRING_DESC_F_WRITE;
 	}
+	for (i = 0; i < (VHOST_NDESCS - 1); ++i) {
+		vring_desc[1][i].next = i+1;
+	}
 	for (i = 0; i < VHOST_NDESCS; ++i) {
 		vring_avail_0.avail.ring[i] = i;
 	}
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (4 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-23  7:09   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

As it is the only function using it. I'm always confusing it with
tcp_l2_iov, moving it here avoids it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tcp_buf.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c
index c999d2e..6d79d67 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
 static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
 static unsigned int tcp_payload_used;
 
-/* recvmsg()/sendmsg() data for tap */
-static struct iovec	iov_sock		[TCP_FRAMES_MEM + 1];
-
 static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
 
 /**
@@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
  */
 int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 {
+	static struct iovec iov_sock[TCP_FRAMES_MEM + 1];
+
 	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
 	int len, dlen, i, s = conn->sock;
-- 
@@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
 static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
 static unsigned int tcp_payload_used;
 
-/* recvmsg()/sendmsg() data for tap */
-static struct iovec	iov_sock		[TCP_FRAMES_MEM + 1];
-
 static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
 
 /**
@@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
  */
 int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 {
+	static struct iovec iov_sock[TCP_FRAMES_MEM + 1];
+
 	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
 	int len, dlen, i, s = conn->sock;
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 07/11] tap: support tx through vhost
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (5 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-24  0:24   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

No users enable vhost right now, just defining the functions.

The use of virtqueue is similar than in rx case.  fills the descriptor
table with packet data it wants to send to the namespace.  Each
descriptor points to a buffer in memory, with an address and a length.
The number of descriptors is again defined by VHOST_NDESCS.

Afterwards it writes the descriptor index into the avail->ring[] array,
then increments avail->idx to make it visible to the kernel, then kicks
the virtqueue 1 event fd.

When the kernel does not need the buffer anymore it writes its id into
the used_ring->ring[], and increments used_ring->idx.  Normally, the
kernel also notifies pasta through call eventfd of the virtqueue 1.
But we don't monitor the eventfd.  Instead, we check if we can reuse the
buffers or not just when we produce, making the code simpler and more
performant.

Like on the rx path, we assume descriptors are used in the same order
they were made available. This is also consistent with behavior seen in
QEMU's virtio-net implementation.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 arp.c     |  2 +-
 tap.c     | 84 +++++++++++++++++++++++++++++++++++++++++++++++--------
 tap.h     |  4 +--
 tcp.c     |  2 +-
 tcp_buf.c |  2 +-
 udp.c     |  2 +-
 6 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/arp.c b/arp.c
index fc482bb..ea786a0 100644
--- a/arp.c
+++ b/arp.c
@@ -80,7 +80,7 @@ int arp(const struct ctx *c, const struct pool *p)
 	memcpy(eh->h_dest,	eh->h_source,	sizeof(eh->h_dest));
 	memcpy(eh->h_source,	c->our_tap_mac,	sizeof(eh->h_source));
 
-	tap_send_single(c, eh, l2len);
+	tap_send_single(c, eh, l2len, false);
 
 	return 1;
 }
diff --git a/tap.c b/tap.c
index 5667fbe..7ccac86 100644
--- a/tap.c
+++ b/tap.c
@@ -121,11 +121,19 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
 static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
 			 "Number of vhost descs must be a power of two by standard");
 static struct {
+	/* Descriptor index we're using. This is not the same as avail idx in
+	 * split: this takes into account the chained descs */
+	uint16_t vring_idx;
+
 	/* Number of free descriptors */
 	uint16_t num_free;
 
 	/* Last used idx processed */
 	uint16_t last_used_idx;
+
+	/* Descriptors in use */
+	/* desc info: number of descriptors in the chain */
+	uint16_t ndescs[VHOST_NDESCS];
 } vqs[2];
 
 static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
@@ -176,7 +184,7 @@ unsigned long tap_l2_max_len(const struct ctx *c)
  * @data:	Packet buffer
  * @l2len:	Total L2 packet length
  */
-void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
+void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost)
 {
 	uint32_t vnet_len = htonl(l2len);
 	struct iovec iov[2];
@@ -192,7 +200,7 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
 		iov[iovcnt].iov_len = l2len;
 		iovcnt++;
 
-		tap_send_frames(c, iov, iovcnt, 1);
+		tap_send_frames(c, iov, iovcnt, 1, vhost);
 		break;
 	case MODE_VU:
 		vu_send_single(c, data, l2len);
@@ -314,7 +322,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 	char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
 
 	memcpy(data, in, dlen);
-	tap_send_single(c, buf, dlen + (data - buf));
+	tap_send_single(c, buf, dlen + (data - buf), false);
 }
 
 /**
@@ -336,7 +344,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
 	memcpy(icmp4h, in, l4len);
 	csum_icmp4(icmp4h, icmp4h + 1, l4len - sizeof(*icmp4h));
 
-	tap_send_single(c, buf, l4len + ((char *)icmp4h - buf));
+	tap_send_single(c, buf, l4len + ((char *)icmp4h - buf), false);
 }
 
 /**
@@ -421,7 +429,7 @@ void tap_udp6_send(const struct ctx *c,
 	char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
 
 	memcpy(data, in, dlen);
-	tap_send_single(c, buf, dlen + (data - buf));
+	tap_send_single(c, buf, dlen + (data - buf), false);
 }
 
 /**
@@ -444,7 +452,7 @@ void tap_icmp6_send(const struct ctx *c,
 	memcpy(icmp6h, in, l4len);
 	csum_icmp6(icmp6h, src, dst, icmp6h + 1, l4len - sizeof(*icmp6h));
 
-	tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
+	tap_send_single(c, buf, l4len + ((char *)icmp6h - buf), false);
 }
 
 static void vhost_kick(struct vring_used *used, int kick_fd) {
@@ -459,8 +467,9 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
 		eventfd_write(kick_fd, 1);
 }
 
+
 /**
- * tap_send_frames_pasta() - Send multiple frames to the pasta tap
+ * tap_send_frames_vhost() - Send multiple frames to the pasta tap
  * @c:			Execution context
  * @iov:		Array of buffers
  * @bufs_per_frame:	Number of buffers (iovec entries) per frame
@@ -470,16 +479,68 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
  * @bufs_per_frame contiguous buffers representing a single frame.
  *
  * Return: number of frames successfully sent
+ */
+static size_t tap_send_frames_vhost(const struct ctx *c,
+				    const struct iovec *iov,
+				    size_t bufs_per_frame, size_t nframes)
+{
+	size_t i;
+
+	for (i = 0; i < nframes; i++) {
+		size_t j;
+
+		if (vqs[1].num_free < bufs_per_frame)
+			return i;
+
+		vring_avail_1.avail.ring[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = htole16(vqs[1].vring_idx) % VHOST_NDESCS;
+		vqs[1].ndescs[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = bufs_per_frame;
+		vqs[1].num_free -= bufs_per_frame;
+
+		for (j = 0; j < bufs_per_frame; ++j) {
+			struct vring_desc *desc = &vring_desc[1][vqs[1].vring_idx % VHOST_NDESCS];
+			const struct iovec *iov_i = &iov[i * bufs_per_frame + j];
+
+			desc->addr = (uint64_t)iov_i->iov_base;
+			desc->len = iov_i->iov_len;
+			desc->flags = (j == bufs_per_frame - 1) ? 0 : htole16(VRING_DESC_F_NEXT);
+			vqs[1].vring_idx++;
+		}
+	}
+
+	smp_wmb();
+	vring_avail_1.avail.idx = htole16(le16toh(vring_avail_1.avail.idx) + nframes);
+
+	vhost_kick(&vring_used_1.used, c->vq[1].kick_fd);
+
+	return nframes;
+}
+
+
+/**
+ * tap_send_frames_pasta() - Send multiple frames to the pasta tap
+ * @c:			Execution context
+ * @iov:		Array of buffers
+ * @bufs_per_frame:	Number of buffers (iovec entries) per frame
+ * @nframes:		Number of frames to send
+ * @vhost:             Use vhost-kernel or not
+ *
+ * @iov must have total length @bufs_per_frame * @nframes, with each set of
+ * @bufs_per_frame contiguous buffers representing a single frame.
+ *
+ * Return: number of frames successfully sent (or queued)
  *
  * #syscalls:pasta write
  */
 static size_t tap_send_frames_pasta(const struct ctx *c,
 				    const struct iovec *iov,
-				    size_t bufs_per_frame, size_t nframes)
+				    size_t bufs_per_frame, size_t nframes, bool vhost)
 {
 	size_t nbufs = bufs_per_frame * nframes;
 	size_t i;
 
+	if (vhost)
+		return tap_send_frames_vhost(c, iov, bufs_per_frame, nframes);
+
 	for (i = 0; i < nbufs; i += bufs_per_frame) {
 		ssize_t rc = writev(c->fd_tap, iov + i, bufs_per_frame);
 		size_t framelen = iov_size(iov + i, bufs_per_frame);
@@ -563,14 +624,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
  * @iov:		Array of buffers, each containing one frame (with L2 headers)
  * @bufs_per_frame:	Number of buffers (iovec entries) per frame
  * @nframes:		Number of frames to send
+ * @vhost:		Use vhost-kernel or not
  *
  * @iov must have total length @bufs_per_frame * @nframes, with each set of
  * @bufs_per_frame contiguous buffers representing a single frame.
  *
- * Return: number of frames actually sent
+ * Return: number of frames actually sent (or queued)
  */
 size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
-		       size_t bufs_per_frame, size_t nframes)
+		       size_t bufs_per_frame, size_t nframes, bool vhost)
 {
 	size_t m;
 
@@ -579,7 +641,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 
 	switch (c->mode) {
 	case MODE_PASTA:
-		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
+		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes, vhost);
 		break;
 	case MODE_PASST:
 		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
diff --git a/tap.h b/tap.h
index ff8cee5..e924dfb 100644
--- a/tap.h
+++ b/tap.h
@@ -111,9 +111,9 @@ void tap_udp6_send(const struct ctx *c,
 void tap_icmp6_send(const struct ctx *c,
 		    const struct in6_addr *src, const struct in6_addr *dst,
 		    const void *in, size_t l4len);
-void tap_send_single(const struct ctx *c, const void *data, size_t l2len);
+void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
 size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
-		       size_t bufs_per_frame, size_t nframes);
+		       size_t bufs_per_frame, size_t nframes, bool vhost);
 void eth_update_mac(struct ethhdr *eh,
 		    const unsigned char *eth_d, const unsigned char *eth_s);
 void tap_listen_handler(struct ctx *c, uint32_t events);
diff --git a/tcp.c b/tcp.c
index f43c1e2..05f5b4c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1935,7 +1935,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,
 
 	tcp_update_csum(psum, rsth, &payload);
 	rst_l2len = ((char *)rsth - buf) + sizeof(*rsth);
-	tap_send_single(c, buf, rst_l2len);
+	tap_send_single(c, buf, rst_l2len, false);
 }
 
 /**
diff --git a/tcp_buf.c b/tcp_buf.c
index 6d79d67..242086d 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -141,7 +141,7 @@ void tcp_payload_flush(const struct ctx *c)
 	size_t m;
 
 	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
-			    tcp_payload_used);
+			    tcp_payload_used, false);
 	if (m != tcp_payload_used) {
 		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
 			       tcp_payload_used - m);
diff --git a/udp.c b/udp.c
index 65a52e0..d017d99 100644
--- a/udp.c
+++ b/udp.c
@@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
 	for (i = 0; i < n; i++)
 		udp_tap_prepare(udp_mh_recv, i, toside, false);
 
-	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
+	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false);
 }
 
 /**
-- 
@@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
 	for (i = 0; i < n; i++)
 		udp_tap_prepare(udp_mh_recv, i, toside, false);
 
-	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
+	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false);
 }
 
 /**
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 08/11] tap: add tap_free_old_xmit
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (6 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-24  0:32   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

As pasta cannot modify the TCP sent buffers until vhost-kernel does not
use them anymore, we need a way to report the caller the buffers that
can be overriden.

Let's start by following the same pattern as in tap write(2): wait until
pasta can override the buffers.  We can add async cleaning on top.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tap.c | 26 ++++++++++++++++++++++++++
 tap.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/tap.c b/tap.c
index 7ccac86..55357e3 100644
--- a/tap.c
+++ b/tap.c
@@ -128,6 +128,11 @@ static struct {
 	/* Number of free descriptors */
 	uint16_t num_free;
 
+	/* Last used_idx in the used ring.
+	 * Duplicate here allows to check for proper vhost usage, and avoid
+	 * false sharing between pasta and kernel. */
+	uint16_t shadow_used_idx;
+
 	/* Last used idx processed */
 	uint16_t last_used_idx;
 
@@ -467,6 +472,27 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
 		eventfd_write(kick_fd, 1);
 }
 
+/* n = target */
+void tap_free_old_xmit(size_t n)
+{
+	size_t r = 0;
+
+	while (r < n) {
+		uint16_t used_idx = vqs[1].last_used_idx;
+		if (vqs[1].shadow_used_idx == used_idx) {
+		       vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx);
+
+		       if (vqs[1].shadow_used_idx == used_idx)
+			       continue;
+		}
+
+		/* assert in-order */
+		assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]);
+		vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS];
+		vqs[1].last_used_idx++;
+		r++;
+	}
+}
 
 /**
  * tap_send_frames_vhost() - Send multiple frames to the pasta tap
diff --git a/tap.h b/tap.h
index e924dfb..7ca0fb0 100644
--- a/tap.h
+++ b/tap.h
@@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
 		    const struct in6_addr *src, const struct in6_addr *dst,
 		    const void *in, size_t l4len);
 void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
+void tap_free_old_xmit(size_t n);
 size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 		       size_t bufs_per_frame, size_t nframes, bool vhost);
 void eth_update_mac(struct ethhdr *eh,
-- 
@@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
 		    const struct in6_addr *src, const struct in6_addr *dst,
 		    const void *in, size_t l4len);
 void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
+void tap_free_old_xmit(size_t n);
 size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 		       size_t bufs_per_frame, size_t nframes, bool vhost);
 void eth_update_mac(struct ethhdr *eh,
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 09/11] tcp: start conversion to circular buffer
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (7 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-24  1:03   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

The vhost-kernel module is async by nature: the driver (pasta) places a
few buffers in the virtqueue and the device (vhost-kernel) trust the
driver will not modify them until it uses them.  To implement it is not
possible with TCP at the moment, as tcp_buf trust it can reuse the
buffers as soon as tcp_payload_flush() finish.

To achieve async let's make tcp_buf work with a circular ring, so vhost
can transmit at the same time pasta is queing more data.  When a buffer
is received from a TCP socket, the element is placed in the ring and
sock_head is moved:
                             [][][][]
                             ^ ^
                             | |
                             | sock_head
                             |
                             tail
                             tap_head

When the data is sent to vhost through the tx queue, tap_head is moved
forward:
                             [][][][]
                             ^ ^
                             | |
                             | sock_head
                             | tap_head
                             |
			     tail

Finally, the tail move forward when vhost has used the tx buffers, so
tcp_payload (and all lower protocol buffers) can be reused.
                             [][][][]
                               ^
                               |
                               sock_head
                               tap_head
			       tail

In the case of error queueing to the vhost virtqueue, sock_head moves
backwards.  The only possible error is that the queue is full, as
virtio-net does not report success on packet sending.

Starting as simple as possible, and only implementing the count
variables in this patch so it keeps working as previously.  The circular
behavior will be added on top.

From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tcp_buf.c | 63 +++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c
index 242086d..0437120 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
 
 /* References tracking the owner connection of frames in the tap outqueue */
 static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_used;
+static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+static void tcp_payload_sock_produce(size_t n)
+{
+	tcp_payload_sock_used += n;
+}
 
 static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
 
@@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 	}
 }
 
+static void tcp_buf_free_old_tap_xmit(void)
+{
+	while (tcp_payload_tap_used) {
+		tap_free_old_xmit(tcp_payload_tap_used);
+
+		tcp_payload_tap_used = 0;
+		tcp_payload_sock_used = 0;
+	}
+}
+
 /**
  * tcp_payload_flush() - Send out buffers for segments with data or flags
  * @c:		Execution context
@@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c)
 	size_t m;
 
 	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
-			    tcp_payload_used, false);
-	if (m != tcp_payload_used) {
+			    tcp_payload_sock_used, true);
+	if (m != tcp_payload_sock_used) {
 		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
-			       tcp_payload_used - m);
+			       tcp_payload_sock_used - m);
 	}
-	tcp_payload_used = 0;
+	tcp_payload_tap_used += m;
+	tcp_buf_free_old_tap_xmit();
 }
 
 /**
@@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	uint32_t seq;
 	int ret;
 
-	iov = tcp_l2_iov[tcp_payload_used];
+	iov = tcp_l2_iov[tcp_payload_sock_used];
 	if (CONN_V4(conn)) {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
 	}
 
@@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (ret <= 0)
 		return ret;
 
-	tcp_payload_used++;
+	tcp_payload_sock_produce(1);
 	l4len = optlen + sizeof(struct tcphdr);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 	tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
 
 	if (flags & DUP_ACK) {
-		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
+		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+		tcp_payload_sock_produce(1);
 
 		memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
 		       iov[TCP_IOV_TAP].iov_len);
@@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 	}
 
-	if (tcp_payload_used > TCP_FRAMES_MEM - 2)
+	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
 		tcp_payload_flush(c);
+	}
 
 	return 0;
 }
@@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	struct iovec *iov;
 
 	conn->seq_to_tap = seq + dlen;
-	tcp_frame_conns[tcp_payload_used] = conn;
-	iov = tcp_l2_iov[tcp_payload_used];
+	tcp_frame_conns[tcp_payload_sock_used] = conn;
+	iov = tcp_l2_iov[tcp_payload_sock_used];
 	if (CONN_V4(conn)) {
 		if (no_csum) {
-			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
+			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
 			struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
 
 			check = &iph->check;
 		}
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else if (CONN_V6(conn)) {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
 	}
 	payload = iov[TCP_IOV_PAYLOAD].iov_base;
@@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	payload->th.psh = push;
 	iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
 	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
-	if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
+	tcp_payload_sock_produce(1);
+	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
 		tcp_payload_flush(c);
+	}
 }
 
 /**
@@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		mh_sock.msg_iovlen = fill_bufs;
 	}
 
-	if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
+	if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
 		tcp_payload_flush(c);
-
-		/* Silence Coverity CWE-125 false positive */
-		tcp_payload_used = 0;
 	}
 
 	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
-		iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
+		iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
 		iov->iov_len = mss;
 	}
 	if (iov_rem)
@@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	dlen = mss;
 	seq = conn->seq_to_tap;
 	for (i = 0; i < send_bufs; i++) {
-		int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
+		int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used;
 		bool push = false;
 
 		if (i == send_bufs - 1) {
-- 
@@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
 
 /* References tracking the owner connection of frames in the tap outqueue */
 static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_used;
+static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+static void tcp_payload_sock_produce(size_t n)
+{
+	tcp_payload_sock_used += n;
+}
 
 static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
 
@@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 	}
 }
 
+static void tcp_buf_free_old_tap_xmit(void)
+{
+	while (tcp_payload_tap_used) {
+		tap_free_old_xmit(tcp_payload_tap_used);
+
+		tcp_payload_tap_used = 0;
+		tcp_payload_sock_used = 0;
+	}
+}
+
 /**
  * tcp_payload_flush() - Send out buffers for segments with data or flags
  * @c:		Execution context
@@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c)
 	size_t m;
 
 	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
-			    tcp_payload_used, false);
-	if (m != tcp_payload_used) {
+			    tcp_payload_sock_used, true);
+	if (m != tcp_payload_sock_used) {
 		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
-			       tcp_payload_used - m);
+			       tcp_payload_sock_used - m);
 	}
-	tcp_payload_used = 0;
+	tcp_payload_tap_used += m;
+	tcp_buf_free_old_tap_xmit();
 }
 
 /**
@@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	uint32_t seq;
 	int ret;
 
-	iov = tcp_l2_iov[tcp_payload_used];
+	iov = tcp_l2_iov[tcp_payload_sock_used];
 	if (CONN_V4(conn)) {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
 	}
 
@@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	if (ret <= 0)
 		return ret;
 
-	tcp_payload_used++;
+	tcp_payload_sock_produce(1);
 	l4len = optlen + sizeof(struct tcphdr);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 	tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
 
 	if (flags & DUP_ACK) {
-		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
+		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+		tcp_payload_sock_produce(1);
 
 		memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
 		       iov[TCP_IOV_TAP].iov_len);
@@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 	}
 
-	if (tcp_payload_used > TCP_FRAMES_MEM - 2)
+	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
 		tcp_payload_flush(c);
+	}
 
 	return 0;
 }
@@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	struct iovec *iov;
 
 	conn->seq_to_tap = seq + dlen;
-	tcp_frame_conns[tcp_payload_used] = conn;
-	iov = tcp_l2_iov[tcp_payload_used];
+	tcp_frame_conns[tcp_payload_sock_used] = conn;
+	iov = tcp_l2_iov[tcp_payload_sock_used];
 	if (CONN_V4(conn)) {
 		if (no_csum) {
-			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
+			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
 			struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
 
 			check = &iph->check;
 		}
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else if (CONN_V6(conn)) {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
 		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
 	}
 	payload = iov[TCP_IOV_PAYLOAD].iov_base;
@@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	payload->th.psh = push;
 	iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
 	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
-	if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
+	tcp_payload_sock_produce(1);
+	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
 		tcp_payload_flush(c);
+	}
 }
 
 /**
@@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 		mh_sock.msg_iovlen = fill_bufs;
 	}
 
-	if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
+	if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
 		tcp_payload_flush(c);
-
-		/* Silence Coverity CWE-125 false positive */
-		tcp_payload_used = 0;
 	}
 
 	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
-		iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
+		iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
 		iov->iov_len = mss;
 	}
 	if (iov_rem)
@@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	dlen = mss;
 	seq = conn->seq_to_tap;
 	for (i = 0; i < send_bufs; i++) {
-		int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
+		int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used;
 		bool push = false;
 
 		if (i == send_bufs - 1) {
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 10/11] tap: add poll(2) to used_idx
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (8 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-24  1:20   ` David Gibson
  2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
  2025-07-10  9:46 ` [RFC v2 00/11] Add vhost-net kernel support Eugenio Perez Martin
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

From ~13Gbit/s to ~11.5Gbit/s.

TODO: Maybe we can reuse epoll for this, not needing to introduce a new
syscall.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tap.c     | 59 +++++++++++++++++++++++++++++++++++++++++++++----------
 tap.h     |  2 +-
 tcp_buf.c |  6 +++---
 3 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/tap.c b/tap.c
index 55357e3..93a8c12 100644
--- a/tap.c
+++ b/tap.c
@@ -19,6 +19,7 @@
 #include <stdio.h>
 #include <errno.h>
 #include <limits.h>
+#include <poll.h>
 #include <string.h>
 #include <net/ethernet.h>
 #include <net/if.h>
@@ -120,7 +121,7 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
 #define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
 static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
 			 "Number of vhost descs must be a power of two by standard");
-static struct {
+static struct vhost_virtqueue {
 	/* Descriptor index we're using. This is not the same as avail idx in
 	 * split: this takes into account the chained descs */
 	uint16_t vring_idx;
@@ -472,26 +473,63 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
 		eventfd_write(kick_fd, 1);
 }

+/*
+ * #syscalls:pasta read poll
+ */
+static uint16_t used_idx(struct vhost_virtqueue *vq,
+			 struct vring_avail *avail,
+			 const struct vring_used *used, int pollfd)
+{
+	struct pollfd fds = { .fd = pollfd, .events = POLLIN };
+	int r;
+
+	if (vq->shadow_used_idx == vq->last_used_idx)
+		vq->shadow_used_idx = le16toh(used->idx);
+
+	if (vq->shadow_used_idx != vq->last_used_idx || pollfd < 0)
+		return vq->shadow_used_idx;
+
+	avail->flags &= ~htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT);
+	/* trusting syscall for smp_wb() */
+	r = read(pollfd, (uint64_t[]){0}, sizeof(uint64_t));
+	assert((r < 0 && errno == EAGAIN) || r == 8);
+
+	/* Another oportunity before syscalls */
+	vq->shadow_used_idx = le16toh(used->idx);
+	if (vq->shadow_used_idx != vq->last_used_idx) {
+		return vqs->shadow_used_idx;
+	}
+
+	r = poll(&fds, 1, -1);
+	assert (0 < r);
+	avail->flags |= htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT);
+	vq->shadow_used_idx = le16toh(used->idx);
+	return vq->shadow_used_idx;
+}
+
 /* n = target */
-void tap_free_old_xmit(size_t n)
+size_t tap_free_old_xmit(const struct ctx *c, size_t n)
 {
 	size_t r = 0;
+	int pollfd = (n == (size_t)-1) ? -1 : c->vq[1].call_fd;

 	while (r < n) {
-		uint16_t used_idx = vqs[1].last_used_idx;
-		if (vqs[1].shadow_used_idx == used_idx) {
-		       vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx);
-
-		       if (vqs[1].shadow_used_idx == used_idx)
-			       continue;
+		uint16_t last_used = vqs[1].last_used_idx;
+		if (used_idx(&vqs[1], &vring_avail_1.avail, &vring_used_1.used, pollfd) == last_used) {
+			assert(pollfd == -1);
+			return r;
 		}

+		/* Only get used array entries after they have been exposed by vhost. */
+		smp_rmb();
 		/* assert in-order */
-		assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]);
-		vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS];
+		assert(vring_used_1.used.ring[last_used % VHOST_NDESCS].id == vring_avail_1.avail.ring[last_used % VHOST_NDESCS]);
+		vqs[1].num_free += vqs[1].ndescs[last_used % VHOST_NDESCS];
 		vqs[1].last_used_idx++;
 		r++;
 	}
+
+	return r;
 }

 /**
@@ -1687,6 +1725,7 @@ static int tap_ns_tun(void *arg)
 			if (rc < 0)
 				die_perror("Failed to add call eventfd to epoll");
 		}
+		fprintf(stderr, "[eperezma %s:%d][i=%d][call_fd=%d]\n", __func__, __LINE__, i, file.fd);
 		c->vq[i].call_fd = file.fd;

 		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
diff --git a/tap.h b/tap.h
index 7ca0fb0..7004116 100644
--- a/tap.h
+++ b/tap.h
@@ -112,7 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
 		    const struct in6_addr *src, const struct in6_addr *dst,
 		    const void *in, size_t l4len);
 void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
-void tap_free_old_xmit(size_t n);
+size_t tap_free_old_xmit(const struct ctx *c, size_t n);
 size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 		       size_t bufs_per_frame, size_t nframes, bool vhost);
 void eth_update_mac(struct ethhdr *eh,
diff --git a/tcp_buf.c b/tcp_buf.c
index 0437120..f74d22d 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -137,10 +137,10 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 	}
 }

-static void tcp_buf_free_old_tap_xmit(void)
+static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
 {
 	while (tcp_payload_tap_used) {
-		tap_free_old_xmit(tcp_payload_tap_used);
+		tap_free_old_xmit(c, tcp_payload_tap_used);

 		tcp_payload_tap_used = 0;
 		tcp_payload_sock_used = 0;
@@ -162,7 +162,7 @@ void tcp_payload_flush(const struct ctx *c)
 			       tcp_payload_sock_used - m);
 	}
 	tcp_payload_tap_used += m;
-	tcp_buf_free_old_tap_xmit();
+	tcp_buf_free_old_tap_xmit(c);
 }

 /**
--
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (9 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
  2025-07-24  1:33   ` David Gibson
  2025-07-10  9:46 ` [RFC v2 00/11] Add vhost-net kernel support Eugenio Perez Martin
  11 siblings, 1 reply; 26+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

Now both tcp_sock and tap uses the circular buffer as intended.

Very lightly tested. Especially, paths like ring full or almost full
that are checked before producing like
tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM.

Processing the tx buffers in a circular buffer makes namespace rx go
from to ~11.5Gbit/s. to ~17.26Gbit/s.

TODO: Increase the tx queue length, as we spend a lot of descriptors in
each request. Ideally, tx size should be at least
bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with
bigger queues.

TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at
least N used tx buffers and the next one in tcp_payload_flush. Maybe we
can optimize it.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 tcp_buf.c | 130 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 106 insertions(+), 24 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c
index f74d22d..326af79 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
 
 /* References tracking the owner connection of frames in the tap outqueue */
 static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+/*
+ * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it
+ * forward, but errors queueing to vhost can move it backwards to tap_head
+ * again.
+ *
+ * tap_head: Head of buffers that have been sent to vhost. flush moves this
+ * forward.
+ *
+ * tail: Chasing index. Increments when vhost uses buffers.
+ *
+ * _used: Independent variables to tell between full and empty.
+ */
+static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used;
+#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1)))
+static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore");
+static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two");
+
+static size_t tcp_payload_cnt_to_end(size_t head, size_t tail)
+{
+	assert(head != tail);
+	size_t end = ARRAY_SIZE(tcp_payload) - tail;
+	size_t n = (head + end) % ARRAY_SIZE(tcp_payload);
+
+	return MIN(n, end);
+}
+
+/* Count the number of items that has been written from sock to the
+ * curcular buffer and can be sent to tap.
+ */
+static size_t tcp_payload_tap_cnt(void)
+{
+	return tcp_payload_sock_used - tcp_payload_tap_used;
+}
 
 static void tcp_payload_sock_produce(size_t n)
 {
+	tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload);
 	tcp_payload_sock_used += n;
 }
 
+/* Count the number of consecutive items that has been written from sock to the
+ * curcular buffer and can be sent to tap without having to wrap back to the
+ * beginning of the buffer.
+ */
+static size_t tcp_payload_tap_cnt_to_end(void)
+{
+	if (tcp_payload_sock_head == tcp_payload_tap_head) {
+		/* empty? */
+		if (tcp_payload_sock_used - tcp_payload_tap_used == 0)
+			return 0;
+
+		/* full */
+		return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head;
+	}
+
+	return tcp_payload_cnt_to_end(tcp_payload_sock_head,
+			              tcp_payload_tap_head);
+}
+
 static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
 
 /**
@@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 	}
 }
 
-static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
+static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target)
 {
-	while (tcp_payload_tap_used) {
-		tap_free_old_xmit(c, tcp_payload_tap_used);
+       size_t n = tap_free_old_xmit(c, target);
 
-		tcp_payload_tap_used = 0;
-		tcp_payload_sock_used = 0;
-	}
+       tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1);
+       tcp_payload_tap_used -= n;
+       tcp_payload_sock_used -= n;
 }
 
 /**
@@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
  */
 void tcp_payload_flush(const struct ctx *c)
 {
-	size_t m;
+	size_t m, n = tcp_payload_tap_cnt_to_end();
+	struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0];
 
-	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
-			    tcp_payload_sock_used, true);
-	if (m != tcp_payload_sock_used) {
-		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
-			       tcp_payload_sock_used - m);
-	}
+	tcp_buf_free_old_tap_xmit(c, (size_t)-1);
+	m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true);
 	tcp_payload_tap_used += m;
-	tcp_buf_free_old_tap_xmit(c);
+	tcp_payload_tap_head = (tcp_payload_tap_head + m) %
+		               ARRAY_SIZE(tcp_payload);
+
+	if (m != n) {
+		n = tcp_payload_tap_cnt_to_end();
+
+		tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head],
+			       &tcp_l2_iov[tcp_payload_tap_head], n);
+		/*
+		 * circular buffer wrap case.
+		 * TODO: Maybe it's better to adapt tcp_revert_seq.
+		 */
+		tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0],
+			       tcp_payload_tap_cnt() - n);
+
+		tcp_payload_sock_head = tcp_payload_tap_head;
+		tcp_payload_sock_used = tcp_payload_tap_used;
+	} else if (tcp_payload_tap_cnt_to_end()) {
+		/* circular buffer wrap case */
+		tcp_payload_flush(c);
+	}
 }
 
 /**
@@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	size_t optlen;
 	size_t l4len;
 	uint32_t seq;
+	unsigned int i = tcp_payload_sock_head;
 	int ret;
 
-	iov = tcp_l2_iov[tcp_payload_sock_used];
+	iov = tcp_l2_iov[i];
 	if (CONN_V4(conn)) {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]);
 		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
 	}
 
@@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		return ret;
 
 	tcp_payload_sock_produce(1);
+	i = tcp_payload_sock_head;
 	l4len = optlen + sizeof(struct tcphdr);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 	tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
 
 	if (flags & DUP_ACK) {
-		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+		struct iovec *dup_iov = tcp_l2_iov[i];
 		tcp_payload_sock_produce(1);
+		i = tcp_payload_sock_head;
 
 		memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
 		       iov[TCP_IOV_TAP].iov_len);
@@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	}
 
 	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
+		tcp_buf_free_old_tap_xmit(c, 2);
 		tcp_payload_flush(c);
+		/* TODO how to fix this? original code didn't chech for success either */
+		assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2);
 	}
 
 	return 0;
@@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	struct iovec *iov;
 
 	conn->seq_to_tap = seq + dlen;
-	tcp_frame_conns[tcp_payload_sock_used] = conn;
-	iov = tcp_l2_iov[tcp_payload_sock_used];
+	tcp_frame_conns[tcp_payload_sock_head] = conn;
+	iov = tcp_l2_iov[tcp_payload_sock_head];
 	if (CONN_V4(conn)) {
 		if (no_csum) {
-			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
+			unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM;
+			struct iovec *iov_prev = tcp_l2_iov[prev];
 			struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
 
 			check = &iph->check;
 		}
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else if (CONN_V6(conn)) {
 		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
@@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
 	tcp_payload_sock_produce(1);
 	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
+		tcp_buf_free_old_tap_xmit(c, 1);
 		tcp_payload_flush(c);
+		assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1);
 	}
+
 }
 
 /**
@@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
+		tcp_buf_free_old_tap_xmit(c, fill_bufs);
 		tcp_payload_flush(c);
+		/* TODO how to report this to upper layers? */
+		assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM);
 	}
 
 	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
-		iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
+		iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data;
 		iov->iov_len = mss;
 	}
 	if (iov_rem)
-- 
@@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
 
 /* References tracking the owner connection of frames in the tap outqueue */
 static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+/*
+ * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it
+ * forward, but errors queueing to vhost can move it backwards to tap_head
+ * again.
+ *
+ * tap_head: Head of buffers that have been sent to vhost. flush moves this
+ * forward.
+ *
+ * tail: Chasing index. Increments when vhost uses buffers.
+ *
+ * _used: Independent variables to tell between full and empty.
+ */
+static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used;
+#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1)))
+static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore");
+static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two");
+
+static size_t tcp_payload_cnt_to_end(size_t head, size_t tail)
+{
+	assert(head != tail);
+	size_t end = ARRAY_SIZE(tcp_payload) - tail;
+	size_t n = (head + end) % ARRAY_SIZE(tcp_payload);
+
+	return MIN(n, end);
+}
+
+/* Count the number of items that has been written from sock to the
+ * curcular buffer and can be sent to tap.
+ */
+static size_t tcp_payload_tap_cnt(void)
+{
+	return tcp_payload_sock_used - tcp_payload_tap_used;
+}
 
 static void tcp_payload_sock_produce(size_t n)
 {
+	tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload);
 	tcp_payload_sock_used += n;
 }
 
+/* Count the number of consecutive items that has been written from sock to the
+ * curcular buffer and can be sent to tap without having to wrap back to the
+ * beginning of the buffer.
+ */
+static size_t tcp_payload_tap_cnt_to_end(void)
+{
+	if (tcp_payload_sock_head == tcp_payload_tap_head) {
+		/* empty? */
+		if (tcp_payload_sock_used - tcp_payload_tap_used == 0)
+			return 0;
+
+		/* full */
+		return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head;
+	}
+
+	return tcp_payload_cnt_to_end(tcp_payload_sock_head,
+			              tcp_payload_tap_head);
+}
+
 static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
 
 /**
@@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
 	}
 }
 
-static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
+static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target)
 {
-	while (tcp_payload_tap_used) {
-		tap_free_old_xmit(c, tcp_payload_tap_used);
+       size_t n = tap_free_old_xmit(c, target);
 
-		tcp_payload_tap_used = 0;
-		tcp_payload_sock_used = 0;
-	}
+       tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1);
+       tcp_payload_tap_used -= n;
+       tcp_payload_sock_used -= n;
 }
 
 /**
@@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
  */
 void tcp_payload_flush(const struct ctx *c)
 {
-	size_t m;
+	size_t m, n = tcp_payload_tap_cnt_to_end();
+	struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0];
 
-	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
-			    tcp_payload_sock_used, true);
-	if (m != tcp_payload_sock_used) {
-		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
-			       tcp_payload_sock_used - m);
-	}
+	tcp_buf_free_old_tap_xmit(c, (size_t)-1);
+	m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true);
 	tcp_payload_tap_used += m;
-	tcp_buf_free_old_tap_xmit(c);
+	tcp_payload_tap_head = (tcp_payload_tap_head + m) %
+		               ARRAY_SIZE(tcp_payload);
+
+	if (m != n) {
+		n = tcp_payload_tap_cnt_to_end();
+
+		tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head],
+			       &tcp_l2_iov[tcp_payload_tap_head], n);
+		/*
+		 * circular buffer wrap case.
+		 * TODO: Maybe it's better to adapt tcp_revert_seq.
+		 */
+		tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0],
+			       tcp_payload_tap_cnt() - n);
+
+		tcp_payload_sock_head = tcp_payload_tap_head;
+		tcp_payload_sock_used = tcp_payload_tap_used;
+	} else if (tcp_payload_tap_cnt_to_end()) {
+		/* circular buffer wrap case */
+		tcp_payload_flush(c);
+	}
 }
 
 /**
@@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	size_t optlen;
 	size_t l4len;
 	uint32_t seq;
+	unsigned int i = tcp_payload_sock_head;
 	int ret;
 
-	iov = tcp_l2_iov[tcp_payload_sock_used];
+	iov = tcp_l2_iov[i];
 	if (CONN_V4(conn)) {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else {
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]);
 		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
 	}
 
@@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		return ret;
 
 	tcp_payload_sock_produce(1);
+	i = tcp_payload_sock_head;
 	l4len = optlen + sizeof(struct tcphdr);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 	tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
 
 	if (flags & DUP_ACK) {
-		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+		struct iovec *dup_iov = tcp_l2_iov[i];
 		tcp_payload_sock_produce(1);
+		i = tcp_payload_sock_head;
 
 		memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
 		       iov[TCP_IOV_TAP].iov_len);
@@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	}
 
 	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
+		tcp_buf_free_old_tap_xmit(c, 2);
 		tcp_payload_flush(c);
+		/* TODO how to fix this? original code didn't chech for success either */
+		assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2);
 	}
 
 	return 0;
@@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	struct iovec *iov;
 
 	conn->seq_to_tap = seq + dlen;
-	tcp_frame_conns[tcp_payload_sock_used] = conn;
-	iov = tcp_l2_iov[tcp_payload_sock_used];
+	tcp_frame_conns[tcp_payload_sock_head] = conn;
+	iov = tcp_l2_iov[tcp_payload_sock_head];
 	if (CONN_V4(conn)) {
 		if (no_csum) {
-			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
+			unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM;
+			struct iovec *iov_prev = tcp_l2_iov[prev];
 			struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
 
 			check = &iph->check;
 		}
-		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]);
 		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
 	} else if (CONN_V6(conn)) {
 		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
@@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
 	tcp_payload_sock_produce(1);
 	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
+		tcp_buf_free_old_tap_xmit(c, 1);
 		tcp_payload_flush(c);
+		assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1);
 	}
+
 }
 
 /**
@@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
+		tcp_buf_free_old_tap_xmit(c, fill_bufs);
 		tcp_payload_flush(c);
+		/* TODO how to report this to upper layers? */
+		assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM);
 	}
 
 	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
-		iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
+		iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data;
 		iov->iov_len = mss;
 	}
 	if (iov_rem)
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC v2 00/11] Add vhost-net kernel support
  2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
                   ` (10 preceding siblings ...)
  2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
@ 2025-07-10  9:46 ` Eugenio Perez Martin
  11 siblings, 0 replies; 26+ messages in thread
From: Eugenio Perez Martin @ 2025-07-10  9:46 UTC (permalink / raw)
  To: passt-dev; +Cc: jasowang

On Wed, Jul 9, 2025 at 7:48 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> vhost-net is a kernel device that allows to read packets from a tap
> device using virtio queues instead of regular read(2) and write(2).
> This enables a more eficient packet processing, as the memory can
> be written directly by the kernel to the userspace and back, instead
> of wasting bandwith on copies, and it enables to batch many packets
> in a single notification (through eventfds) both tx and rx.
>
> Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s.
>
> Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s.
>

For reference, iperf without namespaces and veth both reaches
~47.8Gbit/s in both directions.

> RFC: At this moment only these are supported:
> * Receive l2 packets from the vhost kernel to pasta
> * Send l4 tcp socket received data through vhost-kernel to namespace.
>
> TODO: Add vhost zerocopy in the tests, and compare with veth.
> TODO: Implement at least UDP tx.  Or maybe we want UDP to be write(2) because
> of latency?
> TODO: Check style for variable declarations in for loops and use of curly
> brackets as long as they wrap more than a line.
> TODO: kerneldoc style function header comments
>
> --
> v2: Add TCP tx, and integrated some comments from the previous series. Please
>     check each patch message for details.
>
> Eugenio Pérez (11):
>   tap: implement vhost_call_cb
>   tap: add die() on vhost error
>   Replace tx tap hdr with virtio_nethdr_mrg_rxbuf
>   tcp: export memory regions to vhost
>   virtio: Fill .next in tx queue
>   tap: move static iov_sock to tcp_buf_data_from_sock
>   tap: support tx through vhost
>   tap: add tap_free_old_xmit
>   tcp: start conversion to circular buffer
>   Add poll(2) to used_idx
>   tcp_buf: adding TCP tx circular buffer
>
>  arp.c        |   2 +-
>  epoll_type.h |   4 +
>  passt.c      |  12 +-
>  passt.h      |  11 +-
>  tap.c        | 489 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  tap.h        |  13 +-
>  tcp.c        |   2 +-
>  tcp_buf.c    | 179 +++++++++++++++----
>  tcp_buf.h    |  19 ++
>  udp.c        |   2 +-
>  10 files changed, 675 insertions(+), 58 deletions(-)
>
> --
> 2.50.0
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 01/11] tap: implement vhost_call_cb
  2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
@ 2025-07-23  6:56   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-23  6:56 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 24193 bytes --]

On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio Pérez wrote:
> Implement the rx side of the vhost-kernel, where the namespace send
> packets through its tap device and the kernel passes them to
> pasta using a virtqueue.  The virtqueue is build using a
> virtual ring (vring), which includes:
> * A descriptor table (buffer info)
> * An available ring (buffers ready for the kernel)
> * A used ring (buffers that the kernel has filled)
> 
> The descriptor table holds an array of buffers defined by address and
> length. The kernel writes the packets transmitted by the namespace into
> these buffers.  The number of descriptors (vq size) is set by
> VHOST_NDESCS.  Pasta fills this table using pkt_buf, splitting it
> evenly across all descriptors.  This table is read-only for the kernel.
> 
> The available ring is where pasta marks which buffers the kernel can
> use.  It's read only for the kernel.  It includes a ring[] array with
> the descriptor indexes and an avail->idx index. Pasta increments
> avail->idx when a new buffer is added, modulo the size of the
> virtqueue.  As pasta writes the rx buffers sequentially, ring[] is
> always [0, 1, 2...] and only avail->idx is incremented when new buffers
> are available for the kernel. avail->idx can be incremented by more
> than one at a time.
> 
> Pasta also notifies the kernel of new available buffers by writing to
> the kick eventfd.
> 
> Once the kernel has written a frame in a descriptor it writes its index
> into used_ring->ring[] and increments the used_ring->idx accordly.
> Like the avail idx the kernel can increase it by more than one.  Pasta
> gets a notification in the call eventfd, so we add it into the epoll ctx.
> 
> Pasta assumes buffers are used in order. QEMU also assumes it in the
> virtio-net migration code so it is safe.
> 
> Now, vhost-kernel is designed to read the virtqueues and the buffers as
> *guest* physical addresses (GPA), not process virtual addresses (HVA).
> The way QEMU tells the translations is through the memory regions.
> Since we don't have GPAs, let's just send the memory regions as a 1:1
> translations of the HVA.
> 
> TODO: Evaluate if we can reuse the tap fd code instead of making a new
> epoll event type.

I'm not sure this is a particularly desirable code, unless it obvious
removes a bunch of duplicated code.  As a rule I'd prefer to use
different epoll event types rather than a single epoll type that then
needs to consult additional information to determine how to handle it.

> TODO: Split a new file for vhost (Stefano)
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> RFC v2:
> * Need to integrate "now" parameter in tap_add_packet and replace
> TAP_MSGS to TAP_MSGS{4,6}.
> * Actually refill rx queue at the end of operation.
> * Explain virtqueues and memory regions theory of operation.
> * Extrack vhost_kick so it can be reused by tx.
> * Add macro for VHOST regions.
> * Only register call_cb in rx queue, as tx calls are handled just if
>   needed.
> * Renamed vhost_call_cb to tap_vhost_input (David)
> * Move the inuse and last_used_idx to a writable struct from vhost tx
>   function. Context is readonly for vhost tx code path.
> * Changed from inuse to num_free tracking, more aligned with kernel drv
> * Use always the same tap_pool instead of having the two splitted for v4
>   and v6.
> * Space between (struct vhost_memory_region) {.XXX = ...}. (Stefano)
> * Expand comments about memory region size (David)
> * Add some TODOs.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  epoll_type.h |   2 +
>  passt.c      |   7 +-
>  passt.h      |  10 +-
>  tap.c        | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  tap.h        |   8 ++
>  5 files changed, 335 insertions(+), 3 deletions(-)
> 
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..0371c14 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -44,6 +44,8 @@ enum epoll_type {
>  	EPOLL_TYPE_REPAIR_LISTEN,
>  	/* TCP_REPAIR helper socket */
>  	EPOLL_TYPE_REPAIR,
> +	/* vhost-kernel call socket */
> +	EPOLL_TYPE_VHOST_CALL,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/passt.c b/passt.c
> index 388d10f..0f2659c 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -79,6 +79,7 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
>  	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
> +	[EPOLL_TYPE_VHOST_CALL]		= "vhost-kernel call socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -310,7 +311,8 @@ loop:
>  
>  		switch (ref.type) {
>  		case EPOLL_TYPE_TAP_PASTA:
> -			tap_handler_pasta(&c, eventmask, &now);
> +			// TODO: Find a better way, maybe reuse the fd.
> +			// tap_handler_pasta(&c, eventmask, &now);

This change seems bogus.  We still need this if we want to be able to
fall back to tap without vhost, which I think we want to be able to
do.  I'm also not clear why you need it - if you don't want these
events, can't you just not register them (at least not with this event
type) in epoll_ctl?

>  			break;
>  		case EPOLL_TYPE_TAP_PASST:
>  			tap_handler_passt(&c, eventmask, &now);
> @@ -357,6 +359,9 @@ loop:
>  		case EPOLL_TYPE_REPAIR:
>  			repair_handler(&c, eventmask);
>  			break;
> +		case EPOLL_TYPE_VHOST_CALL:
> +			tap_vhost_input(&c, ref, &now);

I'd suggest calling this tap_handler_vhost() for consistency with
tap_handler_pasta() and others.  I've also found it usually ends up a
bit cleaner to pass the relevant individual members of ref, rather
than the whole thing.

> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 8693794..7bb86c4 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -45,7 +45,7 @@ union epoll_ref;
>   * @icmp:	ICMP-specific reference part
>   * @data:	Data handled by protocol handlers
>   * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
> - * @queue:	vhost-user queue index for this fd
> + * @queue:	vhost queue index for this fd
>   * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
>   */
>  union epoll_ref {
> @@ -269,11 +269,14 @@ struct ctx {
>  	int fd_tap;
>  	int fd_repair_listen;
>  	int fd_repair;
> +	/* TODO document all added fields */
> +	int fd_vhost;
>  	unsigned char our_tap_mac[ETH_ALEN];
>  	unsigned char guest_mac[ETH_ALEN];
>  	uint16_t mtu;
>  
>  	uint64_t hash_secret[2];
> +	uint64_t virtio_features;
>  
>  	int ifi4;
>  	struct ip4_ctx ip4;
> @@ -297,6 +300,11 @@ struct ctx {
>  	int no_icmp;
>  	struct icmp_ctx icmp;
>  
> +	struct {
> +		int kick_fd;
> +		int call_fd;
> +	} vq[2];

Maybe not in scope for this series, but this is reminding me that we
should really split out the tap-backend specific fields into their own
union of substructures to make it clearer what fields are relevant in
which configurations.


>  	int no_dns;
>  	int no_dns_search;
>  	int no_dhcp_dns;
> diff --git a/tap.c b/tap.c
> index 6db5d88..e4a3822 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -31,6 +31,7 @@
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <fcntl.h>
> +#include <sys/eventfd.h>
>  #include <sys/uio.h>
>  #include <stdbool.h>
>  #include <stdlib.h>
> @@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
>  #define TAP_SEQS		128 /* Different L4 tuples in one batch */
>  #define FRAGMENT_MSG_RATE	10  /* # seconds between fragment warnings */
>  
> +#define VHOST_VIRTIO         0xAF
> +#define VHOST_GET_FEATURES   _IOR(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_SET_FEATURES   _IOW(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_SET_OWNER	     _IO(VHOST_VIRTIO, 0x01)
> +#define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
> +#define VHOST_SET_VRING_NUM  _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> +#define VHOST_SET_VRING_ERR  _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
> +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)

This stuff probably belongs in linux_dep.h (if you can't get it from a
system or kernel header).

> +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)

I suspect this is wrong.  For starters we should definitely be using a
define, not open-coded 65520.  But more importantly, 65520 is the
(default) MTU - the maximum size *of the IP packet*, excluding L2
headers.  I'm assuming these buffers will also need to contain the L2
header, so will need to be larger.  Looking at the later code, it
looks like these buffers also include the virtio_net_hdr_mrg_rxbuf
pseudo-physical header, so will need to be bigger again.

I think you'll want a new define for this, which would be
L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header
(virtion_net_hdr_mrg_rxbuf, AFAICT).  For now, I'm going to call this
L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it
later in this mail.  You may also need/want to round it up to
4/8/whatever bytes to avoid poorly aligned buffers.

> +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
> +			 "Number of vhost descs must be a power of two by standard");

Relying on this calculation come out to a power of 2 seems kind of
fragile.  Maybe just round down to a power of 2 instead?  Although I
guess that could mean a bunch of wasted space.

To avoid wasting space if this doesn't come neatly to a power of 2,
would it be safe to round _up_ to a power of 2, as long as we never,
ever put an index beyond what we actually have space for into the
available queue?

> +static struct {
> +	/* Number of free descriptors */
> +	uint16_t num_free;
> +
> +	/* Last used idx processed */
> +	uint16_t last_used_idx;
> +} vqs[2];
> +
> +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
> +static union {
> +	struct vring_avail avail;
> +	char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];

The purpose of this idiom of a union with a char buffer based on
offsetof isn't obvious to me yet.

> +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE)));
> +static union {
> +	struct vring_used used;
> +	char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
> +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));

Maybe 2 entry arrays, rather than *_0 and *_1?

> +
> +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
> +#define N_VHOST_REGIONS 6
> +union {
> +	struct vhost_memory mem;
> +	char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])];
> +} vhost_memory = {
> +	.mem = {
> +		.nregions = N_VHOST_REGIONS,
> +	},
> +};
> +
>  /**
>   * tap_l2_max_len() - Maximum frame size (including L2 header) for current mode
>   * @c:		Execution context
> @@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c,
>  	tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
>  }
>  
> +static void vhost_kick(struct vring_used *used, int kick_fd) {

passt style puts the opening { of a function on the nect line.

> +	/* We need to expose available array entries before checking avail
> +	 * event.
> +	 *
> +	 * TODO: Does eventfd_write already do this?
> +	 */
> +	smp_mb();
> +
> +	if (!(used->flags & VRING_USED_F_NO_NOTIFY))
> +		eventfd_write(kick_fd, 1);
> +}
> +
>  /**
>   * tap_send_frames_pasta() - Send multiple frames to the pasta tap
>   * @c:			Execution context
> @@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  	tap_start_connection(c);
>  }
>
> +static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len)

None of these functions have descriptive comment blocks yet.  I
realise this is an RFC, but it can certainly make things easier to
review if you know what to expect before reading the function.

> +{
> +	struct vring_used *used = !qid ? &vring_used_0.used : &vring_used_1.used;

Using an array as suggested above would avoid this awkward construct.

> +	uint32_t i;
> +	uint16_t used_idx, last_used;

passt style orders declarations in reverse order of line length (when
possible), so this line should be before the previous one.

> +
> +	/* TODO think if this has races with previous eventfd_read */
> +	/* TODO we could improve performance with a shadow_used_idx */
> +	used_idx = le16toh(used->idx);
> +
> +	smp_rmb();
> +
> +	if (used_idx == vqs[0].last_used_idx) {
> +		*len = 0;
> +		return NULL;
> +	}
> +
> +	last_used = vqs[0].last_used_idx % VHOST_NDESCS;
> +	i = le32toh(used->ring[last_used].id);
> +	*len = le32toh(used->ring[last_used].len);
> +
> +	/* Make sure the kernel is consuming the descriptors in order */
> +	if (i != last_used) {
> +		die("vhost: id %u at used position %u != %u", i, last_used, i);
> +		return NULL;
> +	}
> +
> +	if (*len > PKT_BUF_BYTES/VHOST_NDESCS) {

You want your L1_MAX_LEN_VHOST (or whatever it's called) again here.

> +		warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDESCS);

And here.

> +		return NULL;
> +	}
> +
> +	/* TODO check if the id is valid and it has not been double used */
> +	vqs[0].last_used_idx++;
> +	vqs[0].num_free++;
> +	return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);

And here.

> +}
> +
> +/* TODO this assumes the kernel consumes descriptors in order */
> +static void rx_pkt_refill(struct ctx *c)
> +{
> +	/* TODO: tune this threshold */
> +	if (!vqs[0].num_free)
> +		return;
> +
> +	vring_avail_0.avail.idx += vqs[0].num_free;
> +	vqs[0].num_free = 0;
> +	vhost_kick(&vring_used_0.used, c->vq[0].kick_fd);
> +}
> +
> +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now)
> +{
> +	eventfd_read(ref.fd, (eventfd_t[]){ 0 });
> +
> +	tap_flush_pools();
> +
> +	while (true) {
> +		struct virtio_net_hdr_mrg_rxbuf *hdr;
> +		unsigned len;
> +
> +		hdr = virtqueue_get_rx_buf(ref.queue, &len);
> +		if (!hdr)
> +			break;

You could use while ((hdr = virtqueue_get_rx_buf(...))) instead of
while (true), couldn't you?

> +
> +		if (len < sizeof(*hdr)) {

You should check that there are enough bytes for the L2 header as well
as the mrg_rxbuf header - tap_add_packet() doesn't appear to check
that itself (maybe it should).

> +			warn("vhost: invalid len %u", len);
> +			continue;
> +		}
> +
> +		/* TODO this will break from this moment */
> +		if (hdr->num_buffers != 1) {
> +			warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);

L1_MAX_LEN_VHOST again.  Also passt style keeps lines below 80
columns.

> +			continue;
> +		}
> +
> +		/* TODO fix the v6 pool to support ipv6 */

Not clear on why anything particular is needed for ipv6 here.

> +		tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), now);
> +	}
> +
> +	tap_handler(c, now);
> +	rx_pkt_refill(c);
> +}
> +
>  /**
>   * tap_ns_tun() - Get tuntap fd in namespace
>   * @c:		Execution context
> @@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>   */
>  static int tap_ns_tun(void *arg)
>  {
> +	/* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and VHOST_NET_F_VIRTIO_NET_HDR actually */
> +	static const uint64_t features =
> +		(1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);

If I understand the code above properly, for the Rx path at least we
don't need or want F_MRG_RXBUF.  So why require it here?  If we want
it for the Tx path, we can omit it here and add it later in the
series.

>  	struct ifreq ifr = { .ifr_flags = IFF_TAP | IFF_NO_PI };
>  	int flags = O_RDWR | O_NONBLOCK | O_CLOEXEC;
>  	struct ctx *c = (struct ctx *)arg;
> -	int fd, rc;
> +	unsigned i;
> +	int fd, vhost_fd, rc;
>  
>  	c->fd_tap = -1;
>  	memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ);
> @@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg)
>  	if (fd < 0)
>  		die_perror("Failed to open() /dev/net/tun");
>  
> +	vhost_fd = open("/dev/vhost-net", flags);
> +	if (vhost_fd < 0)
> +		die_perror("Failed to open() /dev/vhost-net");
> +
> +	rc = ioctl(vhost_fd, VHOST_SET_OWNER, NULL);
> +	if (rc < 0)
> +		die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed");
> +
> +	rc = ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features);
> +	if (rc < 0)
> +		die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed");
> +
> +	/* TODO inform more explicitely */
> +	fprintf(stderr, "vhost features: %lx\n", c->virtio_features);
> +	fprintf(stderr, "req features: %lx\n", features);

You should use info() or debug() here.

> +	c->virtio_features &= features;
> +	if (c->virtio_features != features)
> +		die("vhost does not support required features");
> +
> +	for (i = 0; i < ARRAY_SIZE(c->vq); i++) {
> +		struct vhost_vring_file file = {
> +			.index = i,
> +		};
> +		union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_CALL,
> +					.queue = i };
> +		struct epoll_event ev;
> +
> +		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> +		ref.fd = file.fd;
> +		if (file.fd < 0)
> +			die_perror("Failed to create call eventfd");

I know it's technically harmless, but it reads better for me if you
check for the error immediately after the eventfd() call, before doing
something else with the value.

> +
> +		rc = ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed");
> +
> +		ev = (struct epoll_event){ .data.u64 = ref.u64, .events = EPOLLIN };
> +		if (i == 0) {
> +			rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
> +			if (rc < 0)
> +				die_perror("Failed to add call eventfd to epoll");
> +		}
> +		c->vq[i].call_fd = file.fd;
> +	}
> +
> +	/*
> +	 * Unlike most of C-style APIs, userspace_addr+memory_size is
> +	 * also accesible by the kernel.  Include a -1 to adjust.
> +	 */
> +#define VHOST_MEMORY_REGION_PTR(addr, size) \
> +	(struct vhost_memory_region) { \
> +		.guest_phys_addr = (uintptr_t)addr, \
> +		.memory_size = size - 1, \
> +		.userspace_addr = (uintptr_t)addr, \
> +	}
> +#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeof(elem))
> +
> +	/* 1:1 translation */
> +	vhost_memory.mem.regions[0] = VHOST_MEMORY_REGION(vring_desc);
> +	vhost_memory.mem.regions[1] = VHOST_MEMORY_REGION(vring_avail_0);
> +	vhost_memory.mem.regions[2] = VHOST_MEMORY_REGION(vring_avail_1);
> +	vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0);
> +	vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1);
> +	vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf);
> +	static_assert(5 < N_VHOST_REGIONS);
> +#undef VHOST_MEMORY_REGION
> +#undef VHOST_MEMORY_REGION_PTR
> +
> +	rc = ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem);
> +	if (rc < 0)
> +		die_perror(
> +			"VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed");
> +
> +	/* TODO: probably it increases RX perf */
> +#if 0
> +	struct ifreq ifr;
> +	memset(&ifr, 0, sizeof(ifr));
> +
> +	if (ioctl(fd, TUNGETIFF, &ifr) != 0)
> +		die_perror("Unable to query TUNGETIFF on FD %d", fd);
> +	}
> +
> +	if (ifr.ifr_flags & IFF_VNET_HDR)
> +		net->dev.features &= ~(1ULL << VIRTIO_NET_F_MRG_RXBUF);
> +#endif
> +	rc = ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features);
> +	if (rc < 0)
> +		die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed");
> +
> +	/* Duplicating foreach queue to follow the exact order from QEMU */
> +	for (i = 0; i < ARRAY_SIZE(c->vq); i++) {
> +		struct vhost_vring_addr addr = {
> +			.index = i,
> +			.desc_user_addr = (unsigned long)vring_desc[i],
> +			.avail_user_addr = i == 0 ? (unsigned long)&vring_avail_0 :
> +										(unsigned long)&vring_avail_1,
> +			.used_user_addr = i == 0 ? (unsigned long)&vring_used_0 :
> +										(unsigned long)&vring_used_1,
> +			/* GPA addr */
> +			.log_guest_addr = i == 0 ? (unsigned long)&vring_used_0 :
> +									   (unsigned long)&vring_used_1,
> +		};
> +		struct vhost_vring_state state = {
> +			.index = i,
> +			.num = VHOST_NDESCS,
> +		};
> +		struct vhost_vring_file file = {
> +			.index = i,
> +		};
> +
> +		rc = ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed");
> +
> +		fprintf(stderr, "qid: %d\n", i);
> +		fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr);
> +		fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr);
> +		fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr);
> +		rc = ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed");
> +
> +		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> +		if (file.fd < 0)
> +			die_perror("Failed to create kick eventfd");
> +		rc = ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed");
> +		c->vq[i].kick_fd = file.fd;
> +
> +		vqs[i].num_free = VHOST_NDESCS;
> +	}
> +
>  	rc = ioctl(fd, (int)TUNSETIFF, &ifr);
>  	if (rc < 0)
>  		die_perror("TUNSETIFF ioctl on /dev/net/tun failed");
> @@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg)
>  	if (!(c->pasta_ifi = if_nametoindex(c->pasta_ifn)))
>  		die("Tap device opened but no network interface found");
>  
> +	for (i = 0; i < VHOST_NDESCS; ++i) {
> +		vring_desc[0][i].addr = (uintptr_t)pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);
> +		vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
> +		vring_desc[0][i].flags = VRING_DESC_F_WRITE;
> +	}
> +	for (i = 0; i < VHOST_NDESCS; ++i) {

No { } for single line blocks in passt style.

> +		vring_avail_0.avail.ring[i] = i;
> +	}
> +
> +	rx_pkt_refill(c);
> +
> +	/* Duplicating foreach queue to follow the exact order from QEMU */
> +	for (int i = 0; i < ARRAY_SIZE(c->vq); i++) {
> +		struct vhost_vring_file file = {
> +			.index = i,
> +			.fd = fd,
> +		};
> +
> +		fprintf(stderr, "qid: %d\n", file.index);
> +		fprintf(stderr, "tap fd: %d\n", file.fd);
> +		rc = ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed");
> +	}
> +
>  	c->fd_tap = fd;
> +	c->fd_vhost = vhost_fd;
>  
>  	return 0;
>  }
> diff --git a/tap.h b/tap.h
> index 6fe3d15..ff8cee5 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
>  		thdr->vnet_len = htonl(l2len);
>  }
>  
> +/**
> + * tap_vhost_input() - Handler for new data on the socket to hypervisor vq
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @now:	Current timestamp
> + */

passt convention puts these comments in the .c file not the .h, even
for exported functions.

> +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now);
> +
>  unsigned long tap_l2_max_len(const struct ctx *c);
>  void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
>  void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 02/11] tap: add die() on vhost error
  2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
@ 2025-07-23  6:58   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-23  6:58 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 2953 bytes --]

On Wed, Jul 09, 2025 at 07:47:39PM +0200, Eugenio Pérez wrote:
> In case the kernel needs to signal an error.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> RFC v2: Add TODO (David)
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  epoll_type.h |  2 ++
>  passt.c      |  5 +++++
>  passt.h      |  1 +
>  tap.c        | 16 ++++++++++++++++
>  4 files changed, 24 insertions(+)
> 
> diff --git a/epoll_type.h b/epoll_type.h
> index 0371c14..93c4701 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -46,6 +46,8 @@ enum epoll_type {
>  	EPOLL_TYPE_REPAIR,
>  	/* vhost-kernel call socket */
>  	EPOLL_TYPE_VHOST_CALL,
> +	/* vhost-kernel error socket */
> +	EPOLL_TYPE_VHOST_ERROR,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/passt.c b/passt.c
> index 0f2659c..d839f5a 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -80,6 +80,7 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
>  	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
>  	[EPOLL_TYPE_VHOST_CALL]		= "vhost-kernel call socket",
> +	[EPOLL_TYPE_VHOST_ERROR]	= "vhost-kernel error socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -362,6 +363,10 @@ loop:
>  		case EPOLL_TYPE_VHOST_CALL:
>  			tap_vhost_input(&c, ref, &now);
>  			break;
> +		case EPOLL_TYPE_VHOST_ERROR:
> +			/* TODO re-initialize vhost */
> +			die("Error on vhost-kernel socket");
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 7bb86c4..0066145 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -303,6 +303,7 @@ struct ctx {
>  	struct {
>  		int kick_fd;
>  		int call_fd;
> +		int err_fd;
>  	} vq[2];
>  
>  	int no_dns;
> diff --git a/tap.c b/tap.c
> index e4a3822..0c49e6d 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1598,6 +1598,22 @@ static int tap_ns_tun(void *arg)
>  				die_perror("Failed to add call eventfd to epoll");
>  		}
>  		c->vq[i].call_fd = file.fd;
> +
> +		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> +		if (file.fd < 0)
> +			die_perror("Failed to create error eventfd");
> +
> +		rc = ioctl(vhost_fd, VHOST_SET_VRING_ERR, &file);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_ERR ioctl on /dev/vhost-net failed");
> +
> +		ref.type = EPOLL_TYPE_VHOST_ERROR, ref.fd = file.fd;
> +		ev.data.u64 = ref.u64;
> +		rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
> +		if (rc < 0)
> +			die_perror("Failed to add error eventfd to epoll");
> +		c->vq[i].err_fd = file.fd;
>  	}
>  
>  	/*

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 04/11] tcp: export memory regions to vhost
  2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
@ 2025-07-23  7:06   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-23  7:06 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 5357 bytes --]

On Wed, Jul 09, 2025 at 07:47:41PM +0200, Eugenio Pérez wrote:
> So vhost kernel is able to access the TCP buffers.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  tap.c     | 14 +++++++++++---
>  tcp_buf.c | 14 ++++----------
>  tcp_buf.h | 19 +++++++++++++++++++
>  3 files changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 0656294..8b3ec45 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -63,6 +63,8 @@
>  #include "vhost_user.h"
>  #include "vu_common.h"
> 
> +#include "tcp_buf.h"
> +

I don't love including the pretty specific content of tcp_buf.h into
the mostly protocol unaware tap.c.  Though I do realise that avoiding
it will probably have other tradeoffs.

>  /* Maximum allowed frame lengths (including L2 header) */
> 
>  /* Verify that an L2 frame length limit is large enough to contain the header,
> @@ -136,8 +138,8 @@ static union {
>  	char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
>  } vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));
> 
> -/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
> -#define N_VHOST_REGIONS 6
> +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf + TCP virtio hdr + TCP eth(src,dst) + TCP ip hdr */
> +#define N_VHOST_REGIONS 12

Hmm.  Keeping this and the region initialisation in sync is pretty
clunky.  I recall that before I went on leave we were discussing just
exposing pasta's whole data segment to vhost; was there a reason for
changing that plan, or just that you haven't implemented it so far?

>  union {
>  	struct vhost_memory mem;
>  	char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])];
> @@ -1635,7 +1637,13 @@ static int tap_ns_tun(void *arg)
>  	vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0);
>  	vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1);
>  	vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf);
> -	static_assert(5 < N_VHOST_REGIONS);
> +	vhost_memory.mem.regions[6] = VHOST_MEMORY_REGION(tcp_payload_tap_hdr);
> +	vhost_memory.mem.regions[7] = VHOST_MEMORY_REGION(tcp4_eth_src);
> +	vhost_memory.mem.regions[8] = VHOST_MEMORY_REGION(tcp6_eth_src);
> +	vhost_memory.mem.regions[9] = VHOST_MEMORY_REGION(tcp4_payload_ip);
> +	vhost_memory.mem.regions[10] = VHOST_MEMORY_REGION(tcp6_payload_ip);
> +	vhost_memory.mem.regions[11] = VHOST_MEMORY_REGION(tcp_payload);
> +	static_assert(11 < N_VHOST_REGIONS);

If all the regions are global variables, you could put them into a
static const array, then define N_VHOST_REGIONS via ARRAY_SIZE().

>  #undef VHOST_MEMORY_REGION
>  #undef VHOST_MEMORY_REGION_PTR
> 
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 2fbd056..c999d2e 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -22,8 +22,6 @@
> 
>  #include <netinet/tcp.h>
> 
> -#include <linux/virtio_net.h>
> -
>  #include "util.h"
>  #include "ip.h"
>  #include "iov.h"
> @@ -35,24 +33,20 @@
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
> 
> -#define TCP_FRAMES_MEM			128
> -#define TCP_FRAMES							   \
> -	(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
> -
>  /* Static buffers */
> 
>  /* Ethernet header for IPv4 and IPv6 frames */
> -static struct ethhdr		tcp4_eth_src;
> -static struct ethhdr		tcp6_eth_src;
> +struct ethhdr		tcp4_eth_src;
> +struct ethhdr		tcp6_eth_src;
> 
> -static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
> +struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
> 
>  /* IP headers for IPv4 and IPv6 */
>  struct iphdr		tcp4_payload_ip[TCP_FRAMES_MEM];
>  struct ipv6hdr		tcp6_payload_ip[TCP_FRAMES_MEM];
> 
>  /* TCP segments with payload for IPv4 and IPv6 frames */
> -static struct tcp_payload_t	tcp_payload[TCP_FRAMES_MEM];
> +struct tcp_payload_t	tcp_payload[TCP_FRAMES_MEM];
> 
>  static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516");
>  static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516");
> diff --git a/tcp_buf.h b/tcp_buf.h
> index 54f5e53..7ae2536 100644
> --- a/tcp_buf.h
> +++ b/tcp_buf.h
> @@ -6,9 +6,28 @@
>  #ifndef TCP_BUF_H
>  #define TCP_BUF_H
> 
> +#include <linux/virtio_net.h>
> +
> +#include "tcp_conn.h"
> +#include "tcp_internal.h"
> +
>  void tcp_sock_iov_init(const struct ctx *c);
>  void tcp_payload_flush(const struct ctx *c);
> +struct tcp_tap_conn;
>  int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
>  int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
> 
> +#define TCP_FRAMES_MEM			128
> +#define TCP_FRAMES							   \
> +(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
> +
> +extern struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
> +extern struct tcp_payload_t	tcp_payload[TCP_FRAMES_MEM];
> +
> +extern struct ethhdr		tcp4_eth_src;
> +extern struct ethhdr		tcp6_eth_src;
> +
> +extern struct iphdr		tcp4_payload_ip[TCP_FRAMES_MEM];
> +extern struct ipv6hdr		tcp6_payload_ip[TCP_FRAMES_MEM];
> +
>  #endif  /*TCP_BUF_H */

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 05/11] virtio: Fill .next in tx queue
  2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
@ 2025-07-23  7:07   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-23  7:07 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 1013 bytes --]

On Wed, Jul 09, 2025 at 07:47:42PM +0200, Eugenio Pérez wrote:
> This way we can send one frame splitted in many buffers. TCP stack

English usage node: s/splitted/split/.

> already works this way in pasta.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  tap.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tap.c b/tap.c
> index 8b3ec45..5667fbe 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg)
>  		vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
>  		vring_desc[0][i].flags = VRING_DESC_F_WRITE;
>  	}
> +	for (i = 0; i < (VHOST_NDESCS - 1); ++i) {
> +		vring_desc[1][i].next = i+1;
> +	}

passt style omits { }  here.

>  	for (i = 0; i < VHOST_NDESCS; ++i) {
>  		vring_avail_0.avail.ring[i] = i;
>  	}

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock
  2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
@ 2025-07-23  7:09   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-23  7:09 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 1524 bytes --]

On Wed, Jul 09, 2025 at 07:47:43PM +0200, Eugenio Pérez wrote:
> As it is the only function using it. I'm always confusing it with
> tcp_l2_iov, moving it here avoids it.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

I like making it local, I'd question whether it even needs to remain 'static'.

> ---
>  tcp_buf.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/tcp_buf.c b/tcp_buf.c
> index c999d2e..6d79d67 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
>  static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
>  static unsigned int tcp_payload_used;
>  
> -/* recvmsg()/sendmsg() data for tap */
> -static struct iovec	iov_sock		[TCP_FRAMES_MEM + 1];
> -
>  static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
>  
>  /**
> @@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>   */
>  int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
> +	static struct iovec iov_sock[TCP_FRAMES_MEM + 1];
> +
>  	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
>  	int len, dlen, i, s = conn->sock;

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf
  2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
@ 2025-07-24  0:17   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-24  0:17 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 3834 bytes --]

On Wed, Jul 09, 2025 at 07:47:40PM +0200, Eugenio Pérez wrote:
> vhost kernel expects this as the first data of the frame.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  tap.c     |  2 +-
>  tcp_buf.c | 15 +++++++++++++--
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 0c49e6d..0656294 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -593,7 +593,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  		      nframes - m, nframes);
> 
>  	pcap_multiple(iov, bufs_per_frame, m,
> -		      c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
> +		      c->mode == MODE_PASST ? sizeof(uint32_t) : sizeof(struct virtio_net_hdr_mrg_rxbuf));

If I understand correctly, you're always considering
virtio_net_hdr_mrg_rxbuf part of the extended frame in PASTA mode,
it's just unused when you're not using the vhost path.  Is that correct?

We probably want a helper that returns the correct tap header length,
a companion to tap_hdr_iov().

> 
>  	return m;
>  }
> diff --git a/tcp_buf.c b/tcp_buf.c
> index d1fca67..2fbd056 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -22,6 +22,8 @@
> 
>  #include <netinet/tcp.h>
> 
> +#include <linux/virtio_net.h>
> +
>  #include "util.h"
>  #include "ip.h"
>  #include "iov.h"
> @@ -43,7 +45,7 @@
>  static struct ethhdr		tcp4_eth_src;
>  static struct ethhdr		tcp6_eth_src;
> 
> -static struct tap_hdr		tcp_payload_tap_hdr[TCP_FRAMES_MEM];
> +static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];

The intention is that struct tap_hdr can store whatever sort of "below
L2" header we need - it's just that so far we only had the trivial
qemu socket header or nothing.  Now we're adding another option, so it
should be a union branch of tap_hdr, rather than a separate structure.

>  /* IP headers for IPv4 and IPv6 */
>  struct iphdr		tcp4_payload_ip[TCP_FRAMES_MEM];
> @@ -75,6 +77,14 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
>  	eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
>  }
> 
> +static inline struct iovec virtio_net_hdr_iov(struct virtio_net_hdr_mrg_rxbuf *hdr)
> +{
> +        return (struct iovec){
> +                .iov_base = hdr,
> +                .iov_len = sizeof(*hdr),
> +        };
> +}

With that unification of tap_hdr_iov(), this should be a branch in
tap_hdr_iov(), rather than a separate function.  tap_hdr_update() will
also need to be updated to only update vnet_len in PASST (qemu socket)
mode.

It is now occurring to me that adding vhost is highlighting an
existing ambiguity in our terminology: "tap" can mean either "the
guest facing interface, of whatever mechanics", or specifically the
kernel tuntap device.  Maybe we need to look at a great renaming.

> +
>  /**
>   * tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
>   * @c:		Execution context
> @@ -85,6 +95,7 @@ void tcp_sock_iov_init(const struct ctx *c)
>  	struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
>  	int i;
> 
> +	(void)c;
>  	tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
>  	tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
> 
> @@ -96,7 +107,7 @@ void tcp_sock_iov_init(const struct ctx *c)
>  	for (i = 0; i < TCP_FRAMES_MEM; i++) {
>  		struct iovec *iov = tcp_l2_iov[i];
> 
> -		iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]);
> +		iov[TCP_IOV_TAP] = virtio_net_hdr_iov(&tcp_payload_tap_hdr[i]);
>  		iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
>  		iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i];
>  	}

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 07/11] tap: support tx through vhost
  2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
@ 2025-07-24  0:24   ` David Gibson
  2025-07-24 14:30     ` Stefano Brivio
  0 siblings, 1 reply; 26+ messages in thread
From: David Gibson @ 2025-07-24  0:24 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 12361 bytes --]

On Wed, Jul 09, 2025 at 07:47:44PM +0200, Eugenio Pérez wrote:
> No users enable vhost right now, just defining the functions.
> 
> The use of virtqueue is similar than in rx case.  fills the descriptor
> table with packet data it wants to send to the namespace.  Each
> descriptor points to a buffer in memory, with an address and a length.
> The number of descriptors is again defined by VHOST_NDESCS.

Does the number of descriptors have to be equal for the Tx and Rx
queues?  For Rx, the number of descriptors is basically determined by
how many frames we can fit in pkt_buf.  That doesn't really apply to
the Tx path though, it would be more natural to define the number of
Tx descriptors independently.

> Afterwards it writes the descriptor index into the avail->ring[] array,
> then increments avail->idx to make it visible to the kernel, then kicks
> the virtqueue 1 event fd.
> 
> When the kernel does not need the buffer anymore it writes its id into
> the used_ring->ring[], and increments used_ring->idx.  Normally, the
> kernel also notifies pasta through call eventfd of the virtqueue 1.
> But we don't monitor the eventfd.  Instead, we check if we can reuse the
> buffers or not just when we produce, making the code simpler and more
> performant.

Oh, that's pretty neat.

Nit: s/performant/performent/

> Like on the rx path, we assume descriptors are used in the same order
> they were made available. This is also consistent with behavior seen in
> QEMU's virtio-net implementation.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  arp.c     |  2 +-
>  tap.c     | 84 +++++++++++++++++++++++++++++++++++++++++++++++--------
>  tap.h     |  4 +--
>  tcp.c     |  2 +-
>  tcp_buf.c |  2 +-
>  udp.c     |  2 +-
>  6 files changed, 79 insertions(+), 17 deletions(-)
> 
> diff --git a/arp.c b/arp.c
> index fc482bb..ea786a0 100644
> --- a/arp.c
> +++ b/arp.c
> @@ -80,7 +80,7 @@ int arp(const struct ctx *c, const struct pool *p)
>  	memcpy(eh->h_dest,	eh->h_source,	sizeof(eh->h_dest));
>  	memcpy(eh->h_source,	c->our_tap_mac,	sizeof(eh->h_source));
>  
> -	tap_send_single(c, eh, l2len);
> +	tap_send_single(c, eh, l2len, false);
>  
>  	return 1;
>  }
> diff --git a/tap.c b/tap.c
> index 5667fbe..7ccac86 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -121,11 +121,19 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
>  static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
>  			 "Number of vhost descs must be a power of two by standard");
>  static struct {
> +	/* Descriptor index we're using. This is not the same as avail idx in
> +	 * split: this takes into account the chained descs */
> +	uint16_t vring_idx;
> +
>  	/* Number of free descriptors */
>  	uint16_t num_free;
>  
>  	/* Last used idx processed */
>  	uint16_t last_used_idx;
> +
> +	/* Descriptors in use */
> +	/* desc info: number of descriptors in the chain */
> +	uint16_t ndescs[VHOST_NDESCS];
>  } vqs[2];

Do we need all these fields for both vqs?  Or should we have different
structs or a union to store different info for the Rx versus Tx queue.

>  
>  static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
> @@ -176,7 +184,7 @@ unsigned long tap_l2_max_len(const struct ctx *c)
>   * @data:	Packet buffer
>   * @l2len:	Total L2 packet length
>   */
> -void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
> +void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost)

This function is explicitly a slow path, so I don't think we need to
offer the option of using vhost.  Likewise with the various wrappers below.

>  {
>  	uint32_t vnet_len = htonl(l2len);
>  	struct iovec iov[2];
> @@ -192,7 +200,7 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
>  		iov[iovcnt].iov_len = l2len;
>  		iovcnt++;
>  
> -		tap_send_frames(c, iov, iovcnt, 1);
> +		tap_send_frames(c, iov, iovcnt, 1, vhost);
>  		break;
>  	case MODE_VU:
>  		vu_send_single(c, data, l2len);
> @@ -314,7 +322,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
>  	char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
>  
>  	memcpy(data, in, dlen);
> -	tap_send_single(c, buf, dlen + (data - buf));
> +	tap_send_single(c, buf, dlen + (data - buf), false);
>  }
>  
>  /**
> @@ -336,7 +344,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
>  	memcpy(icmp4h, in, l4len);
>  	csum_icmp4(icmp4h, icmp4h + 1, l4len - sizeof(*icmp4h));
>  
> -	tap_send_single(c, buf, l4len + ((char *)icmp4h - buf));
> +	tap_send_single(c, buf, l4len + ((char *)icmp4h - buf), false);
>  }
>  
>  /**
> @@ -421,7 +429,7 @@ void tap_udp6_send(const struct ctx *c,
>  	char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
>  
>  	memcpy(data, in, dlen);
> -	tap_send_single(c, buf, dlen + (data - buf));
> +	tap_send_single(c, buf, dlen + (data - buf), false);
>  }
>  
>  /**
> @@ -444,7 +452,7 @@ void tap_icmp6_send(const struct ctx *c,
>  	memcpy(icmp6h, in, l4len);
>  	csum_icmp6(icmp6h, src, dst, icmp6h + 1, l4len - sizeof(*icmp6h));
>  
> -	tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
> +	tap_send_single(c, buf, l4len + ((char *)icmp6h - buf), false);
>  }
>  
>  static void vhost_kick(struct vring_used *used, int kick_fd) {
> @@ -459,8 +467,9 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
>  		eventfd_write(kick_fd, 1);
>  }
>  
> +
>  /**
> - * tap_send_frames_pasta() - Send multiple frames to the pasta tap
> + * tap_send_frames_vhost() - Send multiple frames to the pasta tap
>   * @c:			Execution context
>   * @iov:		Array of buffers
>   * @bufs_per_frame:	Number of buffers (iovec entries) per frame
> @@ -470,16 +479,68 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
>   * @bufs_per_frame contiguous buffers representing a single frame.
>   *
>   * Return: number of frames successfully sent
> + */
> +static size_t tap_send_frames_vhost(const struct ctx *c,
> +				    const struct iovec *iov,
> +				    size_t bufs_per_frame, size_t nframes)
> +{
> +	size_t i;
> +
> +	for (i = 0; i < nframes; i++) {
> +		size_t j;
> +
> +		if (vqs[1].num_free < bufs_per_frame)
> +			return i;
> +
> +		vring_avail_1.avail.ring[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = htole16(vqs[1].vring_idx) % VHOST_NDESCS;

Seems like it would be worth putting ((vring_avail_1.avail.idx + i) %
VHOST_NDESCS) into a local.

> +		vqs[1].ndescs[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = bufs_per_frame;
> +		vqs[1].num_free -= bufs_per_frame;
> +
> +		for (j = 0; j < bufs_per_frame; ++j) {
> +			struct vring_desc *desc = &vring_desc[1][vqs[1].vring_idx % VHOST_NDESCS];
> +			const struct iovec *iov_i = &iov[i * bufs_per_frame + j];
> +
> +			desc->addr = (uint64_t)iov_i->iov_base;
> +			desc->len = iov_i->iov_len;
> +			desc->flags = (j == bufs_per_frame - 1) ? 0 : htole16(VRING_DESC_F_NEXT);
> +			vqs[1].vring_idx++;
> +		}
> +	}
> +
> +	smp_wmb();
> +	vring_avail_1.avail.idx = htole16(le16toh(vring_avail_1.avail.idx) + nframes);
> +
> +	vhost_kick(&vring_used_1.used, c->vq[1].kick_fd);
> +
> +	return nframes;
> +}
> +
> +
> +/**
> + * tap_send_frames_pasta() - Send multiple frames to the pasta tap
> + * @c:			Execution context
> + * @iov:		Array of buffers
> + * @bufs_per_frame:	Number of buffers (iovec entries) per frame
> + * @nframes:		Number of frames to send
> + * @vhost:             Use vhost-kernel or not
> + *
> + * @iov must have total length @bufs_per_frame * @nframes, with each set of
> + * @bufs_per_frame contiguous buffers representing a single frame.
> + *
> + * Return: number of frames successfully sent (or queued)
>   *
>   * #syscalls:pasta write
>   */
>  static size_t tap_send_frames_pasta(const struct ctx *c,
>  				    const struct iovec *iov,
> -				    size_t bufs_per_frame, size_t nframes)
> +				    size_t bufs_per_frame, size_t nframes, bool vhost)
>  {
>  	size_t nbufs = bufs_per_frame * nframes;
>  	size_t i;
>  
> +	if (vhost)
> +		return tap_send_frames_vhost(c, iov, bufs_per_frame, nframes);
> +
>  	for (i = 0; i < nbufs; i += bufs_per_frame) {
>  		ssize_t rc = writev(c->fd_tap, iov + i, bufs_per_frame);
>  		size_t framelen = iov_size(iov + i, bufs_per_frame);
> @@ -563,14 +624,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
>   * @iov:		Array of buffers, each containing one frame (with L2 headers)
>   * @bufs_per_frame:	Number of buffers (iovec entries) per frame
>   * @nframes:		Number of frames to send
> + * @vhost:		Use vhost-kernel or not

Hrm.  I don't love this parameter.  tap_send_frames() is supposed to
more or less abstract away the differences between the different
backends, but here we're putting a tuntap (pasta mode) specific option
here.  I don't quickly see a way to avoid it though; making this a
different mode / a sub-mode in struct ctx would take away the option
of using vanilla tap for slow paths.

>   *
>   * @iov must have total length @bufs_per_frame * @nframes, with each set of
>   * @bufs_per_frame contiguous buffers representing a single frame.
>   *
> - * Return: number of frames actually sent
> + * Return: number of frames actually sent (or queued)
>   */
>  size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
> -		       size_t bufs_per_frame, size_t nframes)
> +		       size_t bufs_per_frame, size_t nframes, bool vhost)
>  {
>  	size_t m;
>  
> @@ -579,7 +641,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  
>  	switch (c->mode) {
>  	case MODE_PASTA:
> -		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
> +		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes, vhost);
>  		break;
>  	case MODE_PASST:
>  		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
> diff --git a/tap.h b/tap.h
> index ff8cee5..e924dfb 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -111,9 +111,9 @@ void tap_udp6_send(const struct ctx *c,
>  void tap_icmp6_send(const struct ctx *c,
>  		    const struct in6_addr *src, const struct in6_addr *dst,
>  		    const void *in, size_t l4len);
> -void tap_send_single(const struct ctx *c, const void *data, size_t l2len);
> +void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
>  size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
> -		       size_t bufs_per_frame, size_t nframes);
> +		       size_t bufs_per_frame, size_t nframes, bool vhost);
>  void eth_update_mac(struct ethhdr *eh,
>  		    const unsigned char *eth_d, const unsigned char *eth_s);
>  void tap_listen_handler(struct ctx *c, uint32_t events);
> diff --git a/tcp.c b/tcp.c
> index f43c1e2..05f5b4c 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1935,7 +1935,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,
>  
>  	tcp_update_csum(psum, rsth, &payload);
>  	rst_l2len = ((char *)rsth - buf) + sizeof(*rsth);
> -	tap_send_single(c, buf, rst_l2len);
> +	tap_send_single(c, buf, rst_l2len, false);
>  }
>  
>  /**
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 6d79d67..242086d 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -141,7 +141,7 @@ void tcp_payload_flush(const struct ctx *c)
>  	size_t m;
>  
>  	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
> -			    tcp_payload_used);
> +			    tcp_payload_used, false);
>  	if (m != tcp_payload_used) {
>  		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
>  			       tcp_payload_used - m);
> diff --git a/udp.c b/udp.c
> index 65a52e0..d017d99 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
>  	for (i = 0; i < n; i++)
>  		udp_tap_prepare(udp_mh_recv, i, toside, false);
>  
> -	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
> +	tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false);
>  }
>  
>  /**

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 08/11] tap: add tap_free_old_xmit
  2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
@ 2025-07-24  0:32   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-24  0:32 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 2833 bytes --]

On Wed, Jul 09, 2025 at 07:47:45PM +0200, Eugenio Pérez wrote:
> As pasta cannot modify the TCP sent buffers until vhost-kernel does not
> use them anymore, we need a way to report the caller the buffers that
> can be overriden.
> 
> Let's start by following the same pattern as in tap write(2): wait until
> pasta can override the buffers.  We can add async cleaning on top.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  tap.c | 26 ++++++++++++++++++++++++++
>  tap.h |  1 +
>  2 files changed, 27 insertions(+)
> 
> diff --git a/tap.c b/tap.c
> index 7ccac86..55357e3 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -128,6 +128,11 @@ static struct {
>  	/* Number of free descriptors */
>  	uint16_t num_free;
>  
> +	/* Last used_idx in the used ring.
> +	 * Duplicate here allows to check for proper vhost usage, and avoid
> +	 * false sharing between pasta and kernel. */
> +	uint16_t shadow_used_idx;
> +
>  	/* Last used idx processed */
>  	uint16_t last_used_idx;
>  
> @@ -467,6 +472,27 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
>  		eventfd_write(kick_fd, 1);
>  }
>  
> +/* n = target */
> +void tap_free_old_xmit(size_t n)

This function is introduced without either a caller or a descriptive
comment.  That makes it pretty hard to review - I'm not sure what it
is supposed to be doing.

> +{
> +	size_t r = 0;
> +
> +	while (r < n) {
> +		uint16_t used_idx = vqs[1].last_used_idx;
> +		if (vqs[1].shadow_used_idx == used_idx) {
> +		       vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx);
> +
> +		       if (vqs[1].shadow_used_idx == used_idx)
> +			       continue;
> +		}
> +
> +		/* assert in-order */
> +		assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]);
> +		vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS];
> +		vqs[1].last_used_idx++;
> +		r++;
> +	}
> +}
>  
>  /**
>   * tap_send_frames_vhost() - Send multiple frames to the pasta tap
> diff --git a/tap.h b/tap.h
> index e924dfb..7ca0fb0 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
>  		    const struct in6_addr *src, const struct in6_addr *dst,
>  		    const void *in, size_t l4len);
>  void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
> +void tap_free_old_xmit(size_t n);
>  size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  		       size_t bufs_per_frame, size_t nframes, bool vhost);
>  void eth_update_mac(struct ethhdr *eh,

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 09/11] tcp: start conversion to circular buffer
  2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
@ 2025-07-24  1:03   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-24  1:03 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 9008 bytes --]

On Wed, Jul 09, 2025 at 07:47:46PM +0200, Eugenio Pérez wrote:
> The vhost-kernel module is async by nature: the driver (pasta) places a
> few buffers in the virtqueue and the device (vhost-kernel) trust the

s/trust/trusts/

> driver will not modify them until it uses them.  To implement it is not
> possible with TCP at the moment, as tcp_buf trust it can reuse the
> buffers as soon as tcp_payload_flush() finish.



> 
> To achieve async let's make tcp_buf work with a circular ring, so vhost
> can transmit at the same time pasta is queing more data.  When a buffer
> is received from a TCP socket, the element is placed in the ring and
> sock_head is moved:
>                              [][][][]
>                              ^ ^
>                              | |
>                              | sock_head
>                              |
>                              tail
>                              tap_head
> 
> When the data is sent to vhost through the tx queue, tap_head is moved
> forward:
>                              [][][][]
>                              ^ ^
>                              | |
>                              | sock_head
>                              | tap_head
>                              |
> 			     tail
> 
> Finally, the tail move forward when vhost has used the tx buffers, so
> tcp_payload (and all lower protocol buffers) can be reused.
>                              [][][][]
>                                ^
>                                |
>                                sock_head
>                                tap_head
> 			       tail

This all sounds good.  I wonder if it might be clearer to do this
circular queue conversion as a separate patch series.  I think it
makes sense even without the context of vhost (it's closer to how most
network things work).

> In the case of error queueing to the vhost virtqueue, sock_head moves
> backwards.  The only possible error is that the queue is full, as

sock_head moves backwards?  Or tap_head moves backwards?

> virtio-net does not report success on packet sending.
> 
> Starting as simple as possible, and only implementing the count
> variables in this patch so it keeps working as previously.  The circular
> behavior will be added on top.
> 
> From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap.

I don't really understand what you're comparing here.

> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  tcp_buf.c | 63 +++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 242086d..0437120 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
>  
>  /* References tracking the owner connection of frames in the tap outqueue */
>  static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
> -static unsigned int tcp_payload_used;
> +static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;

I think the "payload" here is a hangover from when we had separate
queues for flags-only and data-containing packets.  We can probably
drop it and make a bunch of names shorter.

> +static void tcp_payload_sock_produce(size_t n)
> +{
> +	tcp_payload_sock_used += n;
> +}
>  
>  static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
>  
> @@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
>  	}
>  }
>  
> +static void tcp_buf_free_old_tap_xmit(void)
> +{
> +	while (tcp_payload_tap_used) {
> +		tap_free_old_xmit(tcp_payload_tap_used);
> +
> +		tcp_payload_tap_used = 0;
> +		tcp_payload_sock_used = 0;
> +	}
> +}
> +
>  /**
>   * tcp_payload_flush() - Send out buffers for segments with data or flags
>   * @c:		Execution context
> @@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c)
>  	size_t m;
>  
>  	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
> -			    tcp_payload_used, false);
> -	if (m != tcp_payload_used) {
> +			    tcp_payload_sock_used, true);
> +	if (m != tcp_payload_sock_used) {
>  		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
> -			       tcp_payload_used - m);
> +			       tcp_payload_sock_used - m);
>  	}
> -	tcp_payload_used = 0;
> +	tcp_payload_tap_used += m;
> +	tcp_buf_free_old_tap_xmit();
>  }
>  
>  /**
> @@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	uint32_t seq;
>  	int ret;
>  
> -	iov = tcp_l2_iov[tcp_payload_used];
> +	iov = tcp_l2_iov[tcp_payload_sock_used];
>  	if (CONN_V4(conn)) {
> -		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
> +		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
>  		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
>  	} else {
> -		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
> +		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
>  		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
>  	}
>  
> @@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	if (ret <= 0)
>  		return ret;
>  
> -	tcp_payload_used++;
> +	tcp_payload_sock_produce(1);
>  	l4len = optlen + sizeof(struct tcphdr);
>  	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
>  	tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
>  
>  	if (flags & DUP_ACK) {
> -		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
> +		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
> +		tcp_payload_sock_produce(1);
>  
>  		memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
>  		       iov[TCP_IOV_TAP].iov_len);
> @@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
>  	}
>  
> -	if (tcp_payload_used > TCP_FRAMES_MEM - 2)
> +	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
>  		tcp_payload_flush(c);
> +	}

No  { } here in passt style.

>  
>  	return 0;
>  }
> @@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  	struct iovec *iov;
>  
>  	conn->seq_to_tap = seq + dlen;
> -	tcp_frame_conns[tcp_payload_used] = conn;
> -	iov = tcp_l2_iov[tcp_payload_used];
> +	tcp_frame_conns[tcp_payload_sock_used] = conn;
> +	iov = tcp_l2_iov[tcp_payload_sock_used];
>  	if (CONN_V4(conn)) {
>  		if (no_csum) {
> -			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
> +			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
>  			struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
>  
>  			check = &iph->check;
>  		}
> -		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
> +		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
>  		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
>  	} else if (CONN_V6(conn)) {
> -		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
> +		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
>  		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
>  	}
>  	payload = iov[TCP_IOV_PAYLOAD].iov_base;
> @@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  	payload->th.psh = push;
>  	iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
>  	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
> -	if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
> +	tcp_payload_sock_produce(1);
> +	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
>  		tcp_payload_flush(c);
> +	}
>  }
>  
>  /**
> @@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  		mh_sock.msg_iovlen = fill_bufs;
>  	}
>  
> -	if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
> +	if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
>  		tcp_payload_flush(c);
> -
> -		/* Silence Coverity CWE-125 false positive */
> -		tcp_payload_used = 0;
>  	}
>  
>  	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
> -		iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
> +		iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
>  		iov->iov_len = mss;
>  	}
>  	if (iov_rem)
> @@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  	dlen = mss;
>  	seq = conn->seq_to_tap;
>  	for (i = 0; i < send_bufs; i++) {
> -		int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
> +		int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used;
>  		bool push = false;
>  
>  		if (i == send_bufs - 1) {

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 10/11] tap: add poll(2) to used_idx
  2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
@ 2025-07-24  1:20   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-24  1:20 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 5917 bytes --]

On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote:
> From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

> 
> TODO: Maybe we can reuse epoll for this, not needing to introduce a new
> syscall.

I really hope so.

> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  tap.c     | 59 +++++++++++++++++++++++++++++++++++++++++++++----------
>  tap.h     |  2 +-
>  tcp_buf.c |  6 +++---
>  3 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/tap.c b/tap.c
> index 55357e3..93a8c12 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -19,6 +19,7 @@
>  #include <stdio.h>
>  #include <errno.h>
>  #include <limits.h>
> +#include <poll.h>
>  #include <string.h>
>  #include <net/ethernet.h>
>  #include <net/if.h>
> @@ -120,7 +121,7 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
>  #define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
>  static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
>  			 "Number of vhost descs must be a power of two by standard");
> -static struct {
> +static struct vhost_virtqueue {
>  	/* Descriptor index we're using. This is not the same as avail idx in
>  	 * split: this takes into account the chained descs */
>  	uint16_t vring_idx;
> @@ -472,26 +473,63 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
>  		eventfd_write(kick_fd, 1);
>  }
> 
> +/*
> + * #syscalls:pasta read poll
> + */
> +static uint16_t used_idx(struct vhost_virtqueue *vq,
> +			 struct vring_avail *avail,
> +			 const struct vring_used *used, int pollfd)
> +{
> +	struct pollfd fds = { .fd = pollfd, .events = POLLIN };
> +	int r;
> +
> +	if (vq->shadow_used_idx == vq->last_used_idx)
> +		vq->shadow_used_idx = le16toh(used->idx);
> +
> +	if (vq->shadow_used_idx != vq->last_used_idx || pollfd < 0)
> +		return vq->shadow_used_idx;
> +
> +	avail->flags &= ~htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT);
> +	/* trusting syscall for smp_wb() */
> +	r = read(pollfd, (uint64_t[]){0}, sizeof(uint64_t));
> +	assert((r < 0 && errno == EAGAIN) || r == 8);
> +
> +	/* Another oportunity before syscalls */
> +	vq->shadow_used_idx = le16toh(used->idx);
> +	if (vq->shadow_used_idx != vq->last_used_idx) {
> +		return vqs->shadow_used_idx;
> +	}
> +
> +	r = poll(&fds, 1, -1);
> +	assert (0 < r);
> +	avail->flags |= htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT);
> +	vq->shadow_used_idx = le16toh(used->idx);
> +	return vq->shadow_used_idx;

I don't understand what this is accomplishing.  It seems like it's
blocking waiting for a buffer to be freed, which seems like exactly
what we don't want to do.

> +}
> +
>  /* n = target */
> -void tap_free_old_xmit(size_t n)
> +size_t tap_free_old_xmit(const struct ctx *c, size_t n)
>  {
>  	size_t r = 0;
> +	int pollfd = (n == (size_t)-1) ? -1 : c->vq[1].call_fd;
> 
>  	while (r < n) {
> -		uint16_t used_idx = vqs[1].last_used_idx;
> -		if (vqs[1].shadow_used_idx == used_idx) {
> -		       vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx);
> -
> -		       if (vqs[1].shadow_used_idx == used_idx)
> -			       continue;
> +		uint16_t last_used = vqs[1].last_used_idx;
> +		if (used_idx(&vqs[1], &vring_avail_1.avail, &vring_used_1.used, pollfd) == last_used) {
> +			assert(pollfd == -1);
> +			return r;
>  		}
> 
> +		/* Only get used array entries after they have been exposed by vhost. */
> +		smp_rmb();
>  		/* assert in-order */
> -		assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]);
> -		vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS];
> +		assert(vring_used_1.used.ring[last_used % VHOST_NDESCS].id == vring_avail_1.avail.ring[last_used % VHOST_NDESCS]);
> +		vqs[1].num_free += vqs[1].ndescs[last_used % VHOST_NDESCS];
>  		vqs[1].last_used_idx++;
>  		r++;
>  	}
> +
> +	return r;
>  }
> 
>  /**
> @@ -1687,6 +1725,7 @@ static int tap_ns_tun(void *arg)
>  			if (rc < 0)
>  				die_perror("Failed to add call eventfd to epoll");
>  		}
> +		fprintf(stderr, "[eperezma %s:%d][i=%d][call_fd=%d]\n", __func__, __LINE__, i, file.fd);
>  		c->vq[i].call_fd = file.fd;
> 
>  		file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> diff --git a/tap.h b/tap.h
> index 7ca0fb0..7004116 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -112,7 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
>  		    const struct in6_addr *src, const struct in6_addr *dst,
>  		    const void *in, size_t l4len);
>  void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
> -void tap_free_old_xmit(size_t n);
> +size_t tap_free_old_xmit(const struct ctx *c, size_t n);
>  size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  		       size_t bufs_per_frame, size_t nframes, bool vhost);
>  void eth_update_mac(struct ethhdr *eh,
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 0437120..f74d22d 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -137,10 +137,10 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
>  	}
>  }
> 
> -static void tcp_buf_free_old_tap_xmit(void)
> +static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
>  {
>  	while (tcp_payload_tap_used) {
> -		tap_free_old_xmit(tcp_payload_tap_used);
> +		tap_free_old_xmit(c, tcp_payload_tap_used);
> 
>  		tcp_payload_tap_used = 0;
>  		tcp_payload_sock_used = 0;
> @@ -162,7 +162,7 @@ void tcp_payload_flush(const struct ctx *c)
>  			       tcp_payload_sock_used - m);
>  	}
>  	tcp_payload_tap_used += m;
> -	tcp_buf_free_old_tap_xmit();
> +	tcp_buf_free_old_tap_xmit(c);
>  }
> 
>  /**

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer
  2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
@ 2025-07-24  1:33   ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-24  1:33 UTC (permalink / raw)
  To: Eugenio Pérez; +Cc: passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 10149 bytes --]

On Wed, Jul 09, 2025 at 07:47:48PM +0200, Eugenio Pérez wrote:
> Now both tcp_sock and tap uses the circular buffer as intended.
> 
> Very lightly tested. Especially, paths like ring full or almost full
> that are checked before producing like
> tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM.
> 
> Processing the tx buffers in a circular buffer makes namespace rx go
> from to ~11.5Gbit/s. to ~17.26Gbit/s.
> 
> TODO: Increase the tx queue length, as we spend a lot of descriptors in
> each request. Ideally, tx size should be at least
> bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with
> bigger queues.
> 
> TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at
> least N used tx buffers and the next one in tcp_payload_flush. Maybe we
> can optimize it.
> 
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  tcp_buf.c | 130 ++++++++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 106 insertions(+), 24 deletions(-)
> 
> diff --git a/tcp_buf.c b/tcp_buf.c
> index f74d22d..326af79 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
>  
>  /* References tracking the owner connection of frames in the tap outqueue */
>  static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
> -static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
> +
> +/*
> + * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it
> + * forward, but errors queueing to vhost can move it backwards to tap_head
> + * again.
> + *
> + * tap_head: Head of buffers that have been sent to vhost. flush moves this
> + * forward.
> + *
> + * tail: Chasing index. Increments when vhost uses buffers.
> + *
> + * _used: Independent variables to tell between full and empty.

Hm.  I kind of hope there's a less bulky way of doing this.

> + */
> +static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used;
> +#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1)))

Worth putting this in util.h as a separate patch.

> +static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore");
> +static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two");
> +
> +static size_t tcp_payload_cnt_to_end(size_t head, size_t tail)
> +{
> +	assert(head != tail);
> +	size_t end = ARRAY_SIZE(tcp_payload) - tail;
> +	size_t n = (head + end) % ARRAY_SIZE(tcp_payload);
> +
> +	return MIN(n, end);
> +}
> +
> +/* Count the number of items that has been written from sock to the
> + * curcular buffer and can be sent to tap.

s/curcular/circular/g

> + */
> +static size_t tcp_payload_tap_cnt(void)
> +{
> +	return tcp_payload_sock_used - tcp_payload_tap_used;
> +}
>  
>  static void tcp_payload_sock_produce(size_t n)
>  {
> +	tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload);
>  	tcp_payload_sock_used += n;
>  }
>  
> +/* Count the number of consecutive items that has been written from sock to the
> + * curcular buffer and can be sent to tap without having to wrap back to the
> + * beginning of the buffer.
> + */
> +static size_t tcp_payload_tap_cnt_to_end(void)
> +{
> +	if (tcp_payload_sock_head == tcp_payload_tap_head) {
> +		/* empty? */
> +		if (tcp_payload_sock_used - tcp_payload_tap_used == 0)
> +			return 0;
> +
> +		/* full */
> +		return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head;
> +	}
> +
> +	return tcp_payload_cnt_to_end(tcp_payload_sock_head,
> +			              tcp_payload_tap_head);
> +}
> +
>  static struct iovec	tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
>  
>  /**
> @@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
>  	}
>  }
>  
> -static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
> +static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target)
>  {
> -	while (tcp_payload_tap_used) {
> -		tap_free_old_xmit(c, tcp_payload_tap_used);
> +       size_t n = tap_free_old_xmit(c, target);
>  
> -		tcp_payload_tap_used = 0;
> -		tcp_payload_sock_used = 0;
> -	}
> +       tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1);

use % instead of & here - it's consistent with other places, and the
compiler should be able to optimize it to the same thing.

> +       tcp_payload_tap_used -= n;
> +       tcp_payload_sock_used -= n;
>  }
>  
>  /**
> @@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
>   */
>  void tcp_payload_flush(const struct ctx *c)
>  {
> -	size_t m;
> +	size_t m, n = tcp_payload_tap_cnt_to_end();
> +	struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0];
>  
> -	m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
> -			    tcp_payload_sock_used, true);
> -	if (m != tcp_payload_sock_used) {
> -		tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
> -			       tcp_payload_sock_used - m);
> -	}
> +	tcp_buf_free_old_tap_xmit(c, (size_t)-1);
> +	m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true);
>  	tcp_payload_tap_used += m;
> -	tcp_buf_free_old_tap_xmit(c);
> +	tcp_payload_tap_head = (tcp_payload_tap_head + m) %
> +		               ARRAY_SIZE(tcp_payload);
> +
> +	if (m != n) {
> +		n = tcp_payload_tap_cnt_to_end();
> +
> +		tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head],
> +			       &tcp_l2_iov[tcp_payload_tap_head], n);
> +		/*
> +		 * circular buffer wrap case.
> +		 * TODO: Maybe it's better to adapt tcp_revert_seq.
> +		 */
> +		tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0],
> +			       tcp_payload_tap_cnt() - n);
> +
> +		tcp_payload_sock_head = tcp_payload_tap_head;
> +		tcp_payload_sock_used = tcp_payload_tap_used;
> +	} else if (tcp_payload_tap_cnt_to_end()) {
> +		/* circular buffer wrap case */
> +		tcp_payload_flush(c);
> +	}
>  }
>  
>  /**
> @@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	size_t optlen;
>  	size_t l4len;
>  	uint32_t seq;
> +	unsigned int i = tcp_payload_sock_head;
>  	int ret;
>  
> -	iov = tcp_l2_iov[tcp_payload_sock_used];
> +	iov = tcp_l2_iov[i];
>  	if (CONN_V4(conn)) {
> -		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
> +		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]);
>  		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
>  	} else {
> -		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
> +		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]);
>  		iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
>  	}
>  
> @@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		return ret;
>  
>  	tcp_payload_sock_produce(1);
> +	i = tcp_payload_sock_head;
>  	l4len = optlen + sizeof(struct tcphdr);
>  	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
>  	tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
>  
>  	if (flags & DUP_ACK) {
> -		struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
> +		struct iovec *dup_iov = tcp_l2_iov[i];
>  		tcp_payload_sock_produce(1);
> +		i = tcp_payload_sock_head;
>  
>  		memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
>  		       iov[TCP_IOV_TAP].iov_len);
> @@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  	}
>  
>  	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
> +		tcp_buf_free_old_tap_xmit(c, 2);
>  		tcp_payload_flush(c);
> +		/* TODO how to fix this? original code didn't chech for success either */
> +		assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2);
>  	}
>  
>  	return 0;
> @@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  	struct iovec *iov;
>  
>  	conn->seq_to_tap = seq + dlen;
> -	tcp_frame_conns[tcp_payload_sock_used] = conn;
> -	iov = tcp_l2_iov[tcp_payload_sock_used];
> +	tcp_frame_conns[tcp_payload_sock_head] = conn;
> +	iov = tcp_l2_iov[tcp_payload_sock_head];
>  	if (CONN_V4(conn)) {
>  		if (no_csum) {
> -			struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
> +			unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM;
> +			struct iovec *iov_prev = tcp_l2_iov[prev];
>  			struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
>  
>  			check = &iph->check;
>  		}
> -		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
> +		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]);
>  		iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
>  	} else if (CONN_V6(conn)) {
>  		iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
> @@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  	tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
>  	tcp_payload_sock_produce(1);
>  	if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
> +		tcp_buf_free_old_tap_xmit(c, 1);
>  		tcp_payload_flush(c);
> +		assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1);
>  	}
> +
>  }
>  
>  /**
> @@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  	}
>  
>  	if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
> +		tcp_buf_free_old_tap_xmit(c, fill_bufs);
>  		tcp_payload_flush(c);
> +		/* TODO how to report this to upper layers? */
> +		assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM);
>  	}
>  
>  	for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
> -		iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
> +		iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data;
>  		iov->iov_len = mss;
>  	}
>  	if (iov_rem)

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 07/11] tap: support tx through vhost
  2025-07-24  0:24   ` David Gibson
@ 2025-07-24 14:30     ` Stefano Brivio
  2025-07-25  0:23       ` David Gibson
  0 siblings, 1 reply; 26+ messages in thread
From: Stefano Brivio @ 2025-07-24 14:30 UTC (permalink / raw)
  To: David Gibson; +Cc: Eugenio Pérez, passt-dev, jasowang

Silly comments only:

On Thu, 24 Jul 2025 10:24:39 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jul 09, 2025 at 07:47:44PM +0200, Eugenio Pérez wrote:
> > No users enable vhost right now, just defining the functions.
> > 
> > The use of virtqueue is similar than in rx case.  fills the descriptor
> > table with packet data it wants to send to the namespace.  Each
> > descriptor points to a buffer in memory, with an address and a length.
> > The number of descriptors is again defined by VHOST_NDESCS.  
> 
> Does the number of descriptors have to be equal for the Tx and Rx
> queues?  For Rx, the number of descriptors is basically determined by
> how many frames we can fit in pkt_buf.  That doesn't really apply to
> the Tx path though, it would be more natural to define the number of
> Tx descriptors independently.
> 
> > Afterwards it writes the descriptor index into the avail->ring[] array,
> > then increments avail->idx to make it visible to the kernel, then kicks
> > the virtqueue 1 event fd.
> > 
> > When the kernel does not need the buffer anymore it writes its id into
> > the used_ring->ring[], and increments used_ring->idx.  Normally, the
> > kernel also notifies pasta through call eventfd of the virtqueue 1.
> > But we don't monitor the eventfd.  Instead, we check if we can reuse the
> > buffers or not just when we produce, making the code simpler and more
> > performant.  
> 
> Oh, that's pretty neat.
> 
> Nit: s/performant/performent/

Actually, "performent" (which I had never heard before) seems to be an
archaic term for "performance":

  https://www.oed.com/dictionary/performent_n

whereas "performant":

  https://en.wiktionary.org/wiki/performant

seems to be fitting here (as much as I'd favour "efficient" instead,
because it actually sounds like a real word ;)).

-- 
Stefano


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC v2 07/11] tap: support tx through vhost
  2025-07-24 14:30     ` Stefano Brivio
@ 2025-07-25  0:23       ` David Gibson
  0 siblings, 0 replies; 26+ messages in thread
From: David Gibson @ 2025-07-25  0:23 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Eugenio Pérez, passt-dev, jasowang

[-- Attachment #1: Type: text/plain, Size: 2332 bytes --]

On Thu, Jul 24, 2025 at 04:30:50PM +0200, Stefano Brivio wrote:
> Silly comments only:
> 
> On Thu, 24 Jul 2025 10:24:39 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jul 09, 2025 at 07:47:44PM +0200, Eugenio Pérez wrote:
> > > No users enable vhost right now, just defining the functions.
> > > 
> > > The use of virtqueue is similar than in rx case.  fills the descriptor
> > > table with packet data it wants to send to the namespace.  Each
> > > descriptor points to a buffer in memory, with an address and a length.
> > > The number of descriptors is again defined by VHOST_NDESCS.  
> > 
> > Does the number of descriptors have to be equal for the Tx and Rx
> > queues?  For Rx, the number of descriptors is basically determined by
> > how many frames we can fit in pkt_buf.  That doesn't really apply to
> > the Tx path though, it would be more natural to define the number of
> > Tx descriptors independently.
> > 
> > > Afterwards it writes the descriptor index into the avail->ring[] array,
> > > then increments avail->idx to make it visible to the kernel, then kicks
> > > the virtqueue 1 event fd.
> > > 
> > > When the kernel does not need the buffer anymore it writes its id into
> > > the used_ring->ring[], and increments used_ring->idx.  Normally, the
> > > kernel also notifies pasta through call eventfd of the virtqueue 1.
> > > But we don't monitor the eventfd.  Instead, we check if we can reuse the
> > > buffers or not just when we produce, making the code simpler and more
> > > performant.  
> > 
> > Oh, that's pretty neat.
> > 
> > Nit: s/performant/performent/
> 
> Actually, "performent" (which I had never heard before) seems to be an
> archaic term for "performance":
> 
>   https://www.oed.com/dictionary/performent_n
> 
> whereas "performant":
> 
>   https://en.wiktionary.org/wiki/performant

Oops, you're right.  Embarrassing.  "performant" looked wrong to
me... now they both do.

> seems to be fitting here (as much as I'd favour "efficient" instead,
> because it actually sounds like a real word ;)).

Yeah, agreed.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-07-25  0:23 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
2025-07-23  6:56   ` David Gibson
2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
2025-07-23  6:58   ` David Gibson
2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
2025-07-24  0:17   ` David Gibson
2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
2025-07-23  7:06   ` David Gibson
2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
2025-07-23  7:07   ` David Gibson
2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
2025-07-23  7:09   ` David Gibson
2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
2025-07-24  0:24   ` David Gibson
2025-07-24 14:30     ` Stefano Brivio
2025-07-25  0:23       ` David Gibson
2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
2025-07-24  0:32   ` David Gibson
2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
2025-07-24  1:03   ` David Gibson
2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
2025-07-24  1:20   ` David Gibson
2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
2025-07-24  1:33   ` David Gibson
2025-07-10  9:46 ` [RFC v2 00/11] Add vhost-net kernel support Eugenio Perez Martin

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).