* [RFC v2 00/11] Add vhost-net kernel support
@ 2025-07-09 17:47 Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
` (11 more replies)
0 siblings, 12 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
vhost-net is a kernel device that allows to read packets from a tap
device using virtio queues instead of regular read(2) and write(2).
This enables a more eficient packet processing, as the memory can
be written directly by the kernel to the userspace and back, instead
of wasting bandwith on copies, and it enables to batch many packets
in a single notification (through eventfds) both tx and rx.
Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s.
Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s.
RFC: At this moment only these are supported:
* Receive l2 packets from the vhost kernel to pasta
* Send l4 tcp socket received data through vhost-kernel to namespace.
TODO: Add vhost zerocopy in the tests, and compare with veth.
TODO: Implement at least UDP tx. Or maybe we want UDP to be write(2) because
of latency?
TODO: Check style for variable declarations in for loops and use of curly
brackets as long as they wrap more than a line.
TODO: kerneldoc style function header comments
--
v2: Add TCP tx, and integrated some comments from the previous series. Please
check each patch message for details.
Eugenio Pérez (11):
tap: implement vhost_call_cb
tap: add die() on vhost error
Replace tx tap hdr with virtio_nethdr_mrg_rxbuf
tcp: export memory regions to vhost
virtio: Fill .next in tx queue
tap: move static iov_sock to tcp_buf_data_from_sock
tap: support tx through vhost
tap: add tap_free_old_xmit
tcp: start conversion to circular buffer
Add poll(2) to used_idx
tcp_buf: adding TCP tx circular buffer
arp.c | 2 +-
epoll_type.h | 4 +
passt.c | 12 +-
passt.h | 11 +-
tap.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++++--
tap.h | 13 +-
tcp.c | 2 +-
tcp_buf.c | 179 +++++++++++++++----
tcp_buf.h | 19 ++
udp.c | 2 +-
10 files changed, 675 insertions(+), 58 deletions(-)
--
2.50.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC v2 01/11] tap: implement vhost_call_cb
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
Implement the rx side of the vhost-kernel, where the namespace send
packets through its tap device and the kernel passes them to
pasta using a virtqueue. The virtqueue is build using a
virtual ring (vring), which includes:
* A descriptor table (buffer info)
* An available ring (buffers ready for the kernel)
* A used ring (buffers that the kernel has filled)
The descriptor table holds an array of buffers defined by address and
length. The kernel writes the packets transmitted by the namespace into
these buffers. The number of descriptors (vq size) is set by
VHOST_NDESCS. Pasta fills this table using pkt_buf, splitting it
evenly across all descriptors. This table is read-only for the kernel.
The available ring is where pasta marks which buffers the kernel can
use. It's read only for the kernel. It includes a ring[] array with
the descriptor indexes and an avail->idx index. Pasta increments
avail->idx when a new buffer is added, modulo the size of the
virtqueue. As pasta writes the rx buffers sequentially, ring[] is
always [0, 1, 2...] and only avail->idx is incremented when new buffers
are available for the kernel. avail->idx can be incremented by more
than one at a time.
Pasta also notifies the kernel of new available buffers by writing to
the kick eventfd.
Once the kernel has written a frame in a descriptor it writes its index
into used_ring->ring[] and increments the used_ring->idx accordly.
Like the avail idx the kernel can increase it by more than one. Pasta
gets a notification in the call eventfd, so we add it into the epoll ctx.
Pasta assumes buffers are used in order. QEMU also assumes it in the
virtio-net migration code so it is safe.
Now, vhost-kernel is designed to read the virtqueues and the buffers as
*guest* physical addresses (GPA), not process virtual addresses (HVA).
The way QEMU tells the translations is through the memory regions.
Since we don't have GPAs, let's just send the memory regions as a 1:1
translations of the HVA.
TODO: Evaluate if we can reuse the tap fd code instead of making a new
epoll event type.
TODO: Split a new file for vhost (Stefano)
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
RFC v2:
* Need to integrate "now" parameter in tap_add_packet and replace
TAP_MSGS to TAP_MSGS{4,6}.
* Actually refill rx queue at the end of operation.
* Explain virtqueues and memory regions theory of operation.
* Extrack vhost_kick so it can be reused by tx.
* Add macro for VHOST regions.
* Only register call_cb in rx queue, as tx calls are handled just if
needed.
* Renamed vhost_call_cb to tap_vhost_input (David)
* Move the inuse and last_used_idx to a writable struct from vhost tx
function. Context is readonly for vhost tx code path.
* Changed from inuse to num_free tracking, more aligned with kernel drv
* Use always the same tap_pool instead of having the two splitted for v4
and v6.
* Space between (struct vhost_memory_region) {.XXX = ...}. (Stefano)
* Expand comments about memory region size (David)
* Add some TODOs.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
epoll_type.h | 2 +
passt.c | 7 +-
passt.h | 10 +-
tap.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++-
tap.h | 8 ++
5 files changed, 335 insertions(+), 3 deletions(-)
diff --git a/epoll_type.h b/epoll_type.h
index 12ac59b..0371c14 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -44,6 +44,8 @@ enum epoll_type {
EPOLL_TYPE_REPAIR_LISTEN,
/* TCP_REPAIR helper socket */
EPOLL_TYPE_REPAIR,
+ /* vhost-kernel call socket */
+ EPOLL_TYPE_VHOST_CALL,
EPOLL_NUM_TYPES,
};
diff --git a/passt.c b/passt.c
index 388d10f..0f2659c 100644
--- a/passt.c
+++ b/passt.c
@@ -79,6 +79,7 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket",
[EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
[EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
+ [EPOLL_TYPE_VHOST_CALL] = "vhost-kernel call socket",
};
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
"epoll_type_str[] doesn't match enum epoll_type");
@@ -310,7 +311,8 @@ loop:
switch (ref.type) {
case EPOLL_TYPE_TAP_PASTA:
- tap_handler_pasta(&c, eventmask, &now);
+ // TODO: Find a better way, maybe reuse the fd.
+ // tap_handler_pasta(&c, eventmask, &now);
break;
case EPOLL_TYPE_TAP_PASST:
tap_handler_passt(&c, eventmask, &now);
@@ -357,6 +359,9 @@ loop:
case EPOLL_TYPE_REPAIR:
repair_handler(&c, eventmask);
break;
+ case EPOLL_TYPE_VHOST_CALL:
+ tap_vhost_input(&c, ref, &now);
+ break;
default:
/* Can't happen */
ASSERT(0);
diff --git a/passt.h b/passt.h
index 8693794..7bb86c4 100644
--- a/passt.h
+++ b/passt.h
@@ -45,7 +45,7 @@ union epoll_ref;
* @icmp: ICMP-specific reference part
* @data: Data handled by protocol handlers
* @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone
- * @queue: vhost-user queue index for this fd
+ * @queue: vhost queue index for this fd
* @u64: Opaque reference for epoll_ctl() and epoll_wait()
*/
union epoll_ref {
@@ -269,11 +269,14 @@ struct ctx {
int fd_tap;
int fd_repair_listen;
int fd_repair;
+ /* TODO document all added fields */
+ int fd_vhost;
unsigned char our_tap_mac[ETH_ALEN];
unsigned char guest_mac[ETH_ALEN];
uint16_t mtu;
uint64_t hash_secret[2];
+ uint64_t virtio_features;
int ifi4;
struct ip4_ctx ip4;
@@ -297,6 +300,11 @@ struct ctx {
int no_icmp;
struct icmp_ctx icmp;
+ struct {
+ int kick_fd;
+ int call_fd;
+ } vq[2];
+
int no_dns;
int no_dns_search;
int no_dhcp_dns;
diff --git a/tap.c b/tap.c
index 6db5d88..e4a3822 100644
--- a/tap.c
+++ b/tap.c
@@ -31,6 +31,7 @@
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
+#include <sys/eventfd.h>
#include <sys/uio.h>
#include <stdbool.h>
#include <stdlib.h>
@@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
#define TAP_SEQS 128 /* Different L4 tuples in one batch */
#define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */
+#define VHOST_VIRTIO 0xAF
+#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
+static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
+ "Number of vhost descs must be a power of two by standard");
+static struct {
+ /* Number of free descriptors */
+ uint16_t num_free;
+
+ /* Last used idx processed */
+ uint16_t last_used_idx;
+} vqs[2];
+
+static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
+static union {
+ struct vring_avail avail;
+ char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];
+} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE)));
+static union {
+ struct vring_used used;
+ char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
+} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));
+
+/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
+#define N_VHOST_REGIONS 6
+union {
+ struct vhost_memory mem;
+ char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])];
+} vhost_memory = {
+ .mem = {
+ .nregions = N_VHOST_REGIONS,
+ },
+};
+
/**
* tap_l2_max_len() - Maximum frame size (including L2 header) for current mode
* @c: Execution context
@@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c,
tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
}
+static void vhost_kick(struct vring_used *used, int kick_fd) {
+ /* We need to expose available array entries before checking avail
+ * event.
+ *
+ * TODO: Does eventfd_write already do this?
+ */
+ smp_mb();
+
+ if (!(used->flags & VRING_USED_F_NO_NOTIFY))
+ eventfd_write(kick_fd, 1);
+}
+
/**
* tap_send_frames_pasta() - Send multiple frames to the pasta tap
* @c: Execution context
@@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
tap_start_connection(c);
}
+static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len)
+{
+ struct vring_used *used = !qid ? &vring_used_0.used : &vring_used_1.used;
+ uint32_t i;
+ uint16_t used_idx, last_used;
+
+ /* TODO think if this has races with previous eventfd_read */
+ /* TODO we could improve performance with a shadow_used_idx */
+ used_idx = le16toh(used->idx);
+
+ smp_rmb();
+
+ if (used_idx == vqs[0].last_used_idx) {
+ *len = 0;
+ return NULL;
+ }
+
+ last_used = vqs[0].last_used_idx % VHOST_NDESCS;
+ i = le32toh(used->ring[last_used].id);
+ *len = le32toh(used->ring[last_used].len);
+
+ /* Make sure the kernel is consuming the descriptors in order */
+ if (i != last_used) {
+ die("vhost: id %u at used position %u != %u", i, last_used, i);
+ return NULL;
+ }
+
+ if (*len > PKT_BUF_BYTES/VHOST_NDESCS) {
+ warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDESCS);
+ return NULL;
+ }
+
+ /* TODO check if the id is valid and it has not been double used */
+ vqs[0].last_used_idx++;
+ vqs[0].num_free++;
+ return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);
+}
+
+/* TODO this assumes the kernel consumes descriptors in order */
+static void rx_pkt_refill(struct ctx *c)
+{
+ /* TODO: tune this threshold */
+ if (!vqs[0].num_free)
+ return;
+
+ vring_avail_0.avail.idx += vqs[0].num_free;
+ vqs[0].num_free = 0;
+ vhost_kick(&vring_used_0.used, c->vq[0].kick_fd);
+}
+
+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now)
+{
+ eventfd_read(ref.fd, (eventfd_t[]){ 0 });
+
+ tap_flush_pools();
+
+ while (true) {
+ struct virtio_net_hdr_mrg_rxbuf *hdr;
+ unsigned len;
+
+ hdr = virtqueue_get_rx_buf(ref.queue, &len);
+ if (!hdr)
+ break;
+
+ if (len < sizeof(*hdr)) {
+ warn("vhost: invalid len %u", len);
+ continue;
+ }
+
+ /* TODO this will break from this moment */
+ if (hdr->num_buffers != 1) {
+ warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);
+ continue;
+ }
+
+ /* TODO fix the v6 pool to support ipv6 */
+ tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), now);
+ }
+
+ tap_handler(c, now);
+ rx_pkt_refill(c);
+}
+
/**
* tap_ns_tun() - Get tuntap fd in namespace
* @c: Execution context
@@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
*/
static int tap_ns_tun(void *arg)
{
+ /* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and VHOST_NET_F_VIRTIO_NET_HDR actually */
+ static const uint64_t features =
+ (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);
struct ifreq ifr = { .ifr_flags = IFF_TAP | IFF_NO_PI };
int flags = O_RDWR | O_NONBLOCK | O_CLOEXEC;
struct ctx *c = (struct ctx *)arg;
- int fd, rc;
+ unsigned i;
+ int fd, vhost_fd, rc;
c->fd_tap = -1;
memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ);
@@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg)
if (fd < 0)
die_perror("Failed to open() /dev/net/tun");
+ vhost_fd = open("/dev/vhost-net", flags);
+ if (vhost_fd < 0)
+ die_perror("Failed to open() /dev/vhost-net");
+
+ rc = ioctl(vhost_fd, VHOST_SET_OWNER, NULL);
+ if (rc < 0)
+ die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed");
+
+ rc = ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features);
+ if (rc < 0)
+ die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed");
+
+ /* TODO inform more explicitely */
+ fprintf(stderr, "vhost features: %lx\n", c->virtio_features);
+ fprintf(stderr, "req features: %lx\n", features);
+ c->virtio_features &= features;
+ if (c->virtio_features != features)
+ die("vhost does not support required features");
+
+ for (i = 0; i < ARRAY_SIZE(c->vq); i++) {
+ struct vhost_vring_file file = {
+ .index = i,
+ };
+ union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_CALL,
+ .queue = i };
+ struct epoll_event ev;
+
+ file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+ ref.fd = file.fd;
+ if (file.fd < 0)
+ die_perror("Failed to create call eventfd");
+
+ rc = ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file);
+ if (rc < 0)
+ die_perror(
+ "VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed");
+
+ ev = (struct epoll_event){ .data.u64 = ref.u64, .events = EPOLLIN };
+ if (i == 0) {
+ rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+ if (rc < 0)
+ die_perror("Failed to add call eventfd to epoll");
+ }
+ c->vq[i].call_fd = file.fd;
+ }
+
+ /*
+ * Unlike most of C-style APIs, userspace_addr+memory_size is
+ * also accesible by the kernel. Include a -1 to adjust.
+ */
+#define VHOST_MEMORY_REGION_PTR(addr, size) \
+ (struct vhost_memory_region) { \
+ .guest_phys_addr = (uintptr_t)addr, \
+ .memory_size = size - 1, \
+ .userspace_addr = (uintptr_t)addr, \
+ }
+#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeof(elem))
+
+ /* 1:1 translation */
+ vhost_memory.mem.regions[0] = VHOST_MEMORY_REGION(vring_desc);
+ vhost_memory.mem.regions[1] = VHOST_MEMORY_REGION(vring_avail_0);
+ vhost_memory.mem.regions[2] = VHOST_MEMORY_REGION(vring_avail_1);
+ vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0);
+ vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1);
+ vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf);
+ static_assert(5 < N_VHOST_REGIONS);
+#undef VHOST_MEMORY_REGION
+#undef VHOST_MEMORY_REGION_PTR
+
+ rc = ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem);
+ if (rc < 0)
+ die_perror(
+ "VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed");
+
+ /* TODO: probably it increases RX perf */
+#if 0
+ struct ifreq ifr;
+ memset(&ifr, 0, sizeof(ifr));
+
+ if (ioctl(fd, TUNGETIFF, &ifr) != 0)
+ die_perror("Unable to query TUNGETIFF on FD %d", fd);
+ }
+
+ if (ifr.ifr_flags & IFF_VNET_HDR)
+ net->dev.features &= ~(1ULL << VIRTIO_NET_F_MRG_RXBUF);
+#endif
+ rc = ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features);
+ if (rc < 0)
+ die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed");
+
+ /* Duplicating foreach queue to follow the exact order from QEMU */
+ for (i = 0; i < ARRAY_SIZE(c->vq); i++) {
+ struct vhost_vring_addr addr = {
+ .index = i,
+ .desc_user_addr = (unsigned long)vring_desc[i],
+ .avail_user_addr = i == 0 ? (unsigned long)&vring_avail_0 :
+ (unsigned long)&vring_avail_1,
+ .used_user_addr = i == 0 ? (unsigned long)&vring_used_0 :
+ (unsigned long)&vring_used_1,
+ /* GPA addr */
+ .log_guest_addr = i == 0 ? (unsigned long)&vring_used_0 :
+ (unsigned long)&vring_used_1,
+ };
+ struct vhost_vring_state state = {
+ .index = i,
+ .num = VHOST_NDESCS,
+ };
+ struct vhost_vring_file file = {
+ .index = i,
+ };
+
+ rc = ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state);
+ if (rc < 0)
+ die_perror(
+ "VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed");
+
+ fprintf(stderr, "qid: %d\n", i);
+ fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr);
+ fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr);
+ fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr);
+ rc = ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr);
+ if (rc < 0)
+ die_perror(
+ "VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed");
+
+ file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+ if (file.fd < 0)
+ die_perror("Failed to create kick eventfd");
+ rc = ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file);
+ if (rc < 0)
+ die_perror(
+ "VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed");
+ c->vq[i].kick_fd = file.fd;
+
+ vqs[i].num_free = VHOST_NDESCS;
+ }
+
rc = ioctl(fd, (int)TUNSETIFF, &ifr);
if (rc < 0)
die_perror("TUNSETIFF ioctl on /dev/net/tun failed");
@@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg)
if (!(c->pasta_ifi = if_nametoindex(c->pasta_ifn)))
die("Tap device opened but no network interface found");
+ for (i = 0; i < VHOST_NDESCS; ++i) {
+ vring_desc[0][i].addr = (uintptr_t)pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);
+ vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
+ vring_desc[0][i].flags = VRING_DESC_F_WRITE;
+ }
+ for (i = 0; i < VHOST_NDESCS; ++i) {
+ vring_avail_0.avail.ring[i] = i;
+ }
+
+ rx_pkt_refill(c);
+
+ /* Duplicating foreach queue to follow the exact order from QEMU */
+ for (int i = 0; i < ARRAY_SIZE(c->vq); i++) {
+ struct vhost_vring_file file = {
+ .index = i,
+ .fd = fd,
+ };
+
+ fprintf(stderr, "qid: %d\n", file.index);
+ fprintf(stderr, "tap fd: %d\n", file.fd);
+ rc = ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file);
+ if (rc < 0)
+ die_perror(
+ "VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed");
+ }
+
c->fd_tap = fd;
+ c->fd_vhost = vhost_fd;
return 0;
}
diff --git a/tap.h b/tap.h
index 6fe3d15..ff8cee5 100644
--- a/tap.h
+++ b/tap.h
@@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
thdr->vnet_len = htonl(l2len);
}
+/**
+ * tap_vhost_input() - Handler for new data on the socket to hypervisor vq
+ * @c: Execution context
+ * @ref: epoll reference
+ * @now: Current timestamp
+ */
+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now);
+
unsigned long tap_l2_max_len(const struct ctx *c);
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
--
@@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
thdr->vnet_len = htonl(l2len);
}
+/**
+ * tap_vhost_input() - Handler for new data on the socket to hypervisor vq
+ * @c: Execution context
+ * @ref: epoll reference
+ * @now: Current timestamp
+ */
+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now);
+
unsigned long tap_l2_max_len(const struct ctx *c);
void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 02/11] tap: add die() on vhost error
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
In case the kernel needs to signal an error.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
RFC v2: Add TODO (David)
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
epoll_type.h | 2 ++
passt.c | 5 +++++
passt.h | 1 +
tap.c | 16 ++++++++++++++++
4 files changed, 24 insertions(+)
diff --git a/epoll_type.h b/epoll_type.h
index 0371c14..93c4701 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -46,6 +46,8 @@ enum epoll_type {
EPOLL_TYPE_REPAIR,
/* vhost-kernel call socket */
EPOLL_TYPE_VHOST_CALL,
+ /* vhost-kernel error socket */
+ EPOLL_TYPE_VHOST_ERROR,
EPOLL_NUM_TYPES,
};
diff --git a/passt.c b/passt.c
index 0f2659c..d839f5a 100644
--- a/passt.c
+++ b/passt.c
@@ -80,6 +80,7 @@ char *epoll_type_str[] = {
[EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket",
[EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket",
[EPOLL_TYPE_VHOST_CALL] = "vhost-kernel call socket",
+ [EPOLL_TYPE_VHOST_ERROR] = "vhost-kernel error socket",
};
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
"epoll_type_str[] doesn't match enum epoll_type");
@@ -362,6 +363,10 @@ loop:
case EPOLL_TYPE_VHOST_CALL:
tap_vhost_input(&c, ref, &now);
break;
+ case EPOLL_TYPE_VHOST_ERROR:
+ /* TODO re-initialize vhost */
+ die("Error on vhost-kernel socket");
+ break;
default:
/* Can't happen */
ASSERT(0);
diff --git a/passt.h b/passt.h
index 7bb86c4..0066145 100644
--- a/passt.h
+++ b/passt.h
@@ -303,6 +303,7 @@ struct ctx {
struct {
int kick_fd;
int call_fd;
+ int err_fd;
} vq[2];
int no_dns;
diff --git a/tap.c b/tap.c
index e4a3822..0c49e6d 100644
--- a/tap.c
+++ b/tap.c
@@ -1598,6 +1598,22 @@ static int tap_ns_tun(void *arg)
die_perror("Failed to add call eventfd to epoll");
}
c->vq[i].call_fd = file.fd;
+
+ file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+ if (file.fd < 0)
+ die_perror("Failed to create error eventfd");
+
+ rc = ioctl(vhost_fd, VHOST_SET_VRING_ERR, &file);
+ if (rc < 0)
+ die_perror(
+ "VHOST_SET_VRING_ERR ioctl on /dev/vhost-net failed");
+
+ ref.type = EPOLL_TYPE_VHOST_ERROR, ref.fd = file.fd;
+ ev.data.u64 = ref.u64;
+ rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+ if (rc < 0)
+ die_perror("Failed to add error eventfd to epoll");
+ c->vq[i].err_fd = file.fd;
}
/*
--
@@ -1598,6 +1598,22 @@ static int tap_ns_tun(void *arg)
die_perror("Failed to add call eventfd to epoll");
}
c->vq[i].call_fd = file.fd;
+
+ file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
+ if (file.fd < 0)
+ die_perror("Failed to create error eventfd");
+
+ rc = ioctl(vhost_fd, VHOST_SET_VRING_ERR, &file);
+ if (rc < 0)
+ die_perror(
+ "VHOST_SET_VRING_ERR ioctl on /dev/vhost-net failed");
+
+ ref.type = EPOLL_TYPE_VHOST_ERROR, ref.fd = file.fd;
+ ev.data.u64 = ref.u64;
+ rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+ if (rc < 0)
+ die_perror("Failed to add error eventfd to epoll");
+ c->vq[i].err_fd = file.fd;
}
/*
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
vhost kernel expects this as the first data of the frame.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tap.c | 2 +-
tcp_buf.c | 15 +++++++++++++--
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/tap.c b/tap.c
index 0c49e6d..0656294 100644
--- a/tap.c
+++ b/tap.c
@@ -593,7 +593,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
nframes - m, nframes);
pcap_multiple(iov, bufs_per_frame, m,
- c->mode == MODE_PASST ? sizeof(uint32_t) : 0);
+ c->mode == MODE_PASST ? sizeof(uint32_t) : sizeof(struct virtio_net_hdr_mrg_rxbuf));
return m;
}
diff --git a/tcp_buf.c b/tcp_buf.c
index d1fca67..2fbd056 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -22,6 +22,8 @@
#include <netinet/tcp.h>
+#include <linux/virtio_net.h>
+
#include "util.h"
#include "ip.h"
#include "iov.h"
@@ -43,7 +45,7 @@
static struct ethhdr tcp4_eth_src;
static struct ethhdr tcp6_eth_src;
-static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM];
+static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
/* IP headers for IPv4 and IPv6 */
struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
@@ -75,6 +77,14 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
eth_update_mac(&tcp6_eth_src, eth_d, eth_s);
}
+static inline struct iovec virtio_net_hdr_iov(struct virtio_net_hdr_mrg_rxbuf *hdr)
+{
+ return (struct iovec){
+ .iov_base = hdr,
+ .iov_len = sizeof(*hdr),
+ };
+}
+
/**
* tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets
* @c: Execution context
@@ -85,6 +95,7 @@ void tcp_sock_iov_init(const struct ctx *c)
struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP);
int i;
+ (void)c;
tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6);
tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);
@@ -96,7 +107,7 @@ void tcp_sock_iov_init(const struct ctx *c)
for (i = 0; i < TCP_FRAMES_MEM; i++) {
struct iovec *iov = tcp_l2_iov[i];
- iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]);
+ iov[TCP_IOV_TAP] = virtio_net_hdr_iov(&tcp_payload_tap_hdr[i]);
iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i];
}
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 04/11] tcp: export memory regions to vhost
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (2 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
So vhost kernel is able to access the TCP buffers.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tap.c | 14 +++++++++++---
tcp_buf.c | 14 ++++----------
tcp_buf.h | 19 +++++++++++++++++++
3 files changed, 34 insertions(+), 13 deletions(-)
diff --git a/tap.c b/tap.c
index 0656294..8b3ec45 100644
--- a/tap.c
+++ b/tap.c
@@ -63,6 +63,8 @@
#include "vhost_user.h"
#include "vu_common.h"
+#include "tcp_buf.h"
+
/* Maximum allowed frame lengths (including L2 header) */
/* Verify that an L2 frame length limit is large enough to contain the header,
@@ -136,8 +138,8 @@ static union {
char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));
-/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
-#define N_VHOST_REGIONS 6
+/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf + TCP virtio hdr + TCP eth(src,dst) + TCP ip hdr */
+#define N_VHOST_REGIONS 12
union {
struct vhost_memory mem;
char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])];
@@ -1635,7 +1637,13 @@ static int tap_ns_tun(void *arg)
vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0);
vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1);
vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf);
- static_assert(5 < N_VHOST_REGIONS);
+ vhost_memory.mem.regions[6] = VHOST_MEMORY_REGION(tcp_payload_tap_hdr);
+ vhost_memory.mem.regions[7] = VHOST_MEMORY_REGION(tcp4_eth_src);
+ vhost_memory.mem.regions[8] = VHOST_MEMORY_REGION(tcp6_eth_src);
+ vhost_memory.mem.regions[9] = VHOST_MEMORY_REGION(tcp4_payload_ip);
+ vhost_memory.mem.regions[10] = VHOST_MEMORY_REGION(tcp6_payload_ip);
+ vhost_memory.mem.regions[11] = VHOST_MEMORY_REGION(tcp_payload);
+ static_assert(11 < N_VHOST_REGIONS);
#undef VHOST_MEMORY_REGION
#undef VHOST_MEMORY_REGION_PTR
diff --git a/tcp_buf.c b/tcp_buf.c
index 2fbd056..c999d2e 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -22,8 +22,6 @@
#include <netinet/tcp.h>
-#include <linux/virtio_net.h>
-
#include "util.h"
#include "ip.h"
#include "iov.h"
@@ -35,24 +33,20 @@
#include "tcp_internal.h"
#include "tcp_buf.h"
-#define TCP_FRAMES_MEM 128
-#define TCP_FRAMES \
- (c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
-
/* Static buffers */
/* Ethernet header for IPv4 and IPv6 frames */
-static struct ethhdr tcp4_eth_src;
-static struct ethhdr tcp6_eth_src;
+struct ethhdr tcp4_eth_src;
+struct ethhdr tcp6_eth_src;
-static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
+struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
/* IP headers for IPv4 and IPv6 */
struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];
/* TCP segments with payload for IPv4 and IPv6 frames */
-static struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];
+struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];
static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516");
static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516");
diff --git a/tcp_buf.h b/tcp_buf.h
index 54f5e53..7ae2536 100644
--- a/tcp_buf.h
+++ b/tcp_buf.h
@@ -6,9 +6,28 @@
#ifndef TCP_BUF_H
#define TCP_BUF_H
+#include <linux/virtio_net.h>
+
+#include "tcp_conn.h"
+#include "tcp_internal.h"
+
void tcp_sock_iov_init(const struct ctx *c);
void tcp_payload_flush(const struct ctx *c);
+struct tcp_tap_conn;
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn);
int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);
+#define TCP_FRAMES_MEM 128
+#define TCP_FRAMES \
+(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM)
+
+extern struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];
+extern struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];
+
+extern struct ethhdr tcp4_eth_src;
+extern struct ethhdr tcp6_eth_src;
+
+extern struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM];
+extern struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];
+
#endif /*TCP_BUF_H */
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 05/11] virtio: Fill .next in tx queue
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (3 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
This way we can send one frame splitted in many buffers. TCP stack
already works this way in pasta.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tap.c b/tap.c
index 8b3ec45..5667fbe 100644
--- a/tap.c
+++ b/tap.c
@@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg)
vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
vring_desc[0][i].flags = VRING_DESC_F_WRITE;
}
+ for (i = 0; i < (VHOST_NDESCS - 1); ++i) {
+ vring_desc[1][i].next = i+1;
+ }
for (i = 0; i < VHOST_NDESCS; ++i) {
vring_avail_0.avail.ring[i] = i;
}
--
@@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg)
vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS;
vring_desc[0][i].flags = VRING_DESC_F_WRITE;
}
+ for (i = 0; i < (VHOST_NDESCS - 1); ++i) {
+ vring_desc[1][i].next = i+1;
+ }
for (i = 0; i < VHOST_NDESCS; ++i) {
vring_avail_0.avail.ring[i] = i;
}
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (4 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
As it is the only function using it. I'm always confusing it with
tcp_l2_iov, moving it here avoids it.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tcp_buf.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/tcp_buf.c b/tcp_buf.c
index c999d2e..6d79d67 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp_payload_used;
-/* recvmsg()/sendmsg() data for tap */
-static struct iovec iov_sock [TCP_FRAMES_MEM + 1];
-
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
@@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
*/
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
+ static struct iovec iov_sock[TCP_FRAMES_MEM + 1];
+
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
int len, dlen, i, s = conn->sock;
--
@@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
static unsigned int tcp_payload_used;
-/* recvmsg()/sendmsg() data for tap */
-static struct iovec iov_sock [TCP_FRAMES_MEM + 1];
-
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
@@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
*/
int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
{
+ static struct iovec iov_sock[TCP_FRAMES_MEM + 1];
+
uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
int len, dlen, i, s = conn->sock;
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 07/11] tap: support tx through vhost
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (5 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
No users enable vhost right now, just defining the functions.
The use of virtqueue is similar than in rx case. fills the descriptor
table with packet data it wants to send to the namespace. Each
descriptor points to a buffer in memory, with an address and a length.
The number of descriptors is again defined by VHOST_NDESCS.
Afterwards it writes the descriptor index into the avail->ring[] array,
then increments avail->idx to make it visible to the kernel, then kicks
the virtqueue 1 event fd.
When the kernel does not need the buffer anymore it writes its id into
the used_ring->ring[], and increments used_ring->idx. Normally, the
kernel also notifies pasta through call eventfd of the virtqueue 1.
But we don't monitor the eventfd. Instead, we check if we can reuse the
buffers or not just when we produce, making the code simpler and more
performant.
Like on the rx path, we assume descriptors are used in the same order
they were made available. This is also consistent with behavior seen in
QEMU's virtio-net implementation.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
arp.c | 2 +-
tap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++--------
tap.h | 4 +--
tcp.c | 2 +-
tcp_buf.c | 2 +-
udp.c | 2 +-
6 files changed, 79 insertions(+), 17 deletions(-)
diff --git a/arp.c b/arp.c
index fc482bb..ea786a0 100644
--- a/arp.c
+++ b/arp.c
@@ -80,7 +80,7 @@ int arp(const struct ctx *c, const struct pool *p)
memcpy(eh->h_dest, eh->h_source, sizeof(eh->h_dest));
memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
- tap_send_single(c, eh, l2len);
+ tap_send_single(c, eh, l2len, false);
return 1;
}
diff --git a/tap.c b/tap.c
index 5667fbe..7ccac86 100644
--- a/tap.c
+++ b/tap.c
@@ -121,11 +121,19 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
"Number of vhost descs must be a power of two by standard");
static struct {
+ /* Descriptor index we're using. This is not the same as avail idx in
+ * split: this takes into account the chained descs */
+ uint16_t vring_idx;
+
/* Number of free descriptors */
uint16_t num_free;
/* Last used idx processed */
uint16_t last_used_idx;
+
+ /* Descriptors in use */
+ /* desc info: number of descriptors in the chain */
+ uint16_t ndescs[VHOST_NDESCS];
} vqs[2];
static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
@@ -176,7 +184,7 @@ unsigned long tap_l2_max_len(const struct ctx *c)
* @data: Packet buffer
* @l2len: Total L2 packet length
*/
-void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
+void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost)
{
uint32_t vnet_len = htonl(l2len);
struct iovec iov[2];
@@ -192,7 +200,7 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
iov[iovcnt].iov_len = l2len;
iovcnt++;
- tap_send_frames(c, iov, iovcnt, 1);
+ tap_send_frames(c, iov, iovcnt, 1, vhost);
break;
case MODE_VU:
vu_send_single(c, data, l2len);
@@ -314,7 +322,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
memcpy(data, in, dlen);
- tap_send_single(c, buf, dlen + (data - buf));
+ tap_send_single(c, buf, dlen + (data - buf), false);
}
/**
@@ -336,7 +344,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst,
memcpy(icmp4h, in, l4len);
csum_icmp4(icmp4h, icmp4h + 1, l4len - sizeof(*icmp4h));
- tap_send_single(c, buf, l4len + ((char *)icmp4h - buf));
+ tap_send_single(c, buf, l4len + ((char *)icmp4h - buf), false);
}
/**
@@ -421,7 +429,7 @@ void tap_udp6_send(const struct ctx *c,
char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);
memcpy(data, in, dlen);
- tap_send_single(c, buf, dlen + (data - buf));
+ tap_send_single(c, buf, dlen + (data - buf), false);
}
/**
@@ -444,7 +452,7 @@ void tap_icmp6_send(const struct ctx *c,
memcpy(icmp6h, in, l4len);
csum_icmp6(icmp6h, src, dst, icmp6h + 1, l4len - sizeof(*icmp6h));
- tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
+ tap_send_single(c, buf, l4len + ((char *)icmp6h - buf), false);
}
static void vhost_kick(struct vring_used *used, int kick_fd) {
@@ -459,8 +467,9 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
eventfd_write(kick_fd, 1);
}
+
/**
- * tap_send_frames_pasta() - Send multiple frames to the pasta tap
+ * tap_send_frames_vhost() - Send multiple frames to the pasta tap
* @c: Execution context
* @iov: Array of buffers
* @bufs_per_frame: Number of buffers (iovec entries) per frame
@@ -470,16 +479,68 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
* @bufs_per_frame contiguous buffers representing a single frame.
*
* Return: number of frames successfully sent
+ */
+static size_t tap_send_frames_vhost(const struct ctx *c,
+ const struct iovec *iov,
+ size_t bufs_per_frame, size_t nframes)
+{
+ size_t i;
+
+ for (i = 0; i < nframes; i++) {
+ size_t j;
+
+ if (vqs[1].num_free < bufs_per_frame)
+ return i;
+
+ vring_avail_1.avail.ring[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = htole16(vqs[1].vring_idx) % VHOST_NDESCS;
+ vqs[1].ndescs[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = bufs_per_frame;
+ vqs[1].num_free -= bufs_per_frame;
+
+ for (j = 0; j < bufs_per_frame; ++j) {
+ struct vring_desc *desc = &vring_desc[1][vqs[1].vring_idx % VHOST_NDESCS];
+ const struct iovec *iov_i = &iov[i * bufs_per_frame + j];
+
+ desc->addr = (uint64_t)iov_i->iov_base;
+ desc->len = iov_i->iov_len;
+ desc->flags = (j == bufs_per_frame - 1) ? 0 : htole16(VRING_DESC_F_NEXT);
+ vqs[1].vring_idx++;
+ }
+ }
+
+ smp_wmb();
+ vring_avail_1.avail.idx = htole16(le16toh(vring_avail_1.avail.idx) + nframes);
+
+ vhost_kick(&vring_used_1.used, c->vq[1].kick_fd);
+
+ return nframes;
+}
+
+
+/**
+ * tap_send_frames_pasta() - Send multiple frames to the pasta tap
+ * @c: Execution context
+ * @iov: Array of buffers
+ * @bufs_per_frame: Number of buffers (iovec entries) per frame
+ * @nframes: Number of frames to send
+ * @vhost: Use vhost-kernel or not
+ *
+ * @iov must have total length @bufs_per_frame * @nframes, with each set of
+ * @bufs_per_frame contiguous buffers representing a single frame.
+ *
+ * Return: number of frames successfully sent (or queued)
*
* #syscalls:pasta write
*/
static size_t tap_send_frames_pasta(const struct ctx *c,
const struct iovec *iov,
- size_t bufs_per_frame, size_t nframes)
+ size_t bufs_per_frame, size_t nframes, bool vhost)
{
size_t nbufs = bufs_per_frame * nframes;
size_t i;
+ if (vhost)
+ return tap_send_frames_vhost(c, iov, bufs_per_frame, nframes);
+
for (i = 0; i < nbufs; i += bufs_per_frame) {
ssize_t rc = writev(c->fd_tap, iov + i, bufs_per_frame);
size_t framelen = iov_size(iov + i, bufs_per_frame);
@@ -563,14 +624,15 @@ static size_t tap_send_frames_passt(const struct ctx *c,
* @iov: Array of buffers, each containing one frame (with L2 headers)
* @bufs_per_frame: Number of buffers (iovec entries) per frame
* @nframes: Number of frames to send
+ * @vhost: Use vhost-kernel or not
*
* @iov must have total length @bufs_per_frame * @nframes, with each set of
* @bufs_per_frame contiguous buffers representing a single frame.
*
- * Return: number of frames actually sent
+ * Return: number of frames actually sent (or queued)
*/
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
- size_t bufs_per_frame, size_t nframes)
+ size_t bufs_per_frame, size_t nframes, bool vhost)
{
size_t m;
@@ -579,7 +641,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
switch (c->mode) {
case MODE_PASTA:
- m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
+ m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes, vhost);
break;
case MODE_PASST:
m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
diff --git a/tap.h b/tap.h
index ff8cee5..e924dfb 100644
--- a/tap.h
+++ b/tap.h
@@ -111,9 +111,9 @@ void tap_udp6_send(const struct ctx *c,
void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
const void *in, size_t l4len);
-void tap_send_single(const struct ctx *c, const void *data, size_t l2len);
+void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
- size_t bufs_per_frame, size_t nframes);
+ size_t bufs_per_frame, size_t nframes, bool vhost);
void eth_update_mac(struct ethhdr *eh,
const unsigned char *eth_d, const unsigned char *eth_s);
void tap_listen_handler(struct ctx *c, uint32_t events);
diff --git a/tcp.c b/tcp.c
index f43c1e2..05f5b4c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1935,7 +1935,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,
tcp_update_csum(psum, rsth, &payload);
rst_l2len = ((char *)rsth - buf) + sizeof(*rsth);
- tap_send_single(c, buf, rst_l2len);
+ tap_send_single(c, buf, rst_l2len, false);
}
/**
diff --git a/tcp_buf.c b/tcp_buf.c
index 6d79d67..242086d 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -141,7 +141,7 @@ void tcp_payload_flush(const struct ctx *c)
size_t m;
m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
- tcp_payload_used);
+ tcp_payload_used, false);
if (m != tcp_payload_used) {
tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
tcp_payload_used - m);
diff --git a/udp.c b/udp.c
index 65a52e0..d017d99 100644
--- a/udp.c
+++ b/udp.c
@@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
for (i = 0; i < n; i++)
udp_tap_prepare(udp_mh_recv, i, toside, false);
- tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
+ tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false);
}
/**
--
@@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
for (i = 0; i < n; i++)
udp_tap_prepare(udp_mh_recv, i, toside, false);
- tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n);
+ tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false);
}
/**
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 08/11] tap: add tap_free_old_xmit
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (6 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
As pasta cannot modify the TCP sent buffers until vhost-kernel does not
use them anymore, we need a way to report the caller the buffers that
can be overriden.
Let's start by following the same pattern as in tap write(2): wait until
pasta can override the buffers. We can add async cleaning on top.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tap.c | 26 ++++++++++++++++++++++++++
tap.h | 1 +
2 files changed, 27 insertions(+)
diff --git a/tap.c b/tap.c
index 7ccac86..55357e3 100644
--- a/tap.c
+++ b/tap.c
@@ -128,6 +128,11 @@ static struct {
/* Number of free descriptors */
uint16_t num_free;
+ /* Last used_idx in the used ring.
+ * Duplicate here allows to check for proper vhost usage, and avoid
+ * false sharing between pasta and kernel. */
+ uint16_t shadow_used_idx;
+
/* Last used idx processed */
uint16_t last_used_idx;
@@ -467,6 +472,27 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
eventfd_write(kick_fd, 1);
}
+/* n = target */
+void tap_free_old_xmit(size_t n)
+{
+ size_t r = 0;
+
+ while (r < n) {
+ uint16_t used_idx = vqs[1].last_used_idx;
+ if (vqs[1].shadow_used_idx == used_idx) {
+ vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx);
+
+ if (vqs[1].shadow_used_idx == used_idx)
+ continue;
+ }
+
+ /* assert in-order */
+ assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]);
+ vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS];
+ vqs[1].last_used_idx++;
+ r++;
+ }
+}
/**
* tap_send_frames_vhost() - Send multiple frames to the pasta tap
diff --git a/tap.h b/tap.h
index e924dfb..7ca0fb0 100644
--- a/tap.h
+++ b/tap.h
@@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
const void *in, size_t l4len);
void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
+void tap_free_old_xmit(size_t n);
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
size_t bufs_per_frame, size_t nframes, bool vhost);
void eth_update_mac(struct ethhdr *eh,
--
@@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
const void *in, size_t l4len);
void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
+void tap_free_old_xmit(size_t n);
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
size_t bufs_per_frame, size_t nframes, bool vhost);
void eth_update_mac(struct ethhdr *eh,
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 09/11] tcp: start conversion to circular buffer
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (7 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
The vhost-kernel module is async by nature: the driver (pasta) places a
few buffers in the virtqueue and the device (vhost-kernel) trust the
driver will not modify them until it uses them. To implement it is not
possible with TCP at the moment, as tcp_buf trust it can reuse the
buffers as soon as tcp_payload_flush() finish.
To achieve async let's make tcp_buf work with a circular ring, so vhost
can transmit at the same time pasta is queing more data. When a buffer
is received from a TCP socket, the element is placed in the ring and
sock_head is moved:
[][][][]
^ ^
| |
| sock_head
|
tail
tap_head
When the data is sent to vhost through the tx queue, tap_head is moved
forward:
[][][][]
^ ^
| |
| sock_head
| tap_head
|
tail
Finally, the tail move forward when vhost has used the tx buffers, so
tcp_payload (and all lower protocol buffers) can be reused.
[][][][]
^
|
sock_head
tap_head
tail
In the case of error queueing to the vhost virtqueue, sock_head moves
backwards. The only possible error is that the queue is full, as
virtio-net does not report success on packet sending.
Starting as simple as possible, and only implementing the count
variables in this patch so it keeps working as previously. The circular
behavior will be added on top.
From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tcp_buf.c | 63 +++++++++++++++++++++++++++++++++++--------------------
1 file changed, 40 insertions(+), 23 deletions(-)
diff --git a/tcp_buf.c b/tcp_buf.c
index 242086d..0437120 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_used;
+static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+static void tcp_payload_sock_produce(size_t n)
+{
+ tcp_payload_sock_used += n;
+}
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
@@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
}
}
+static void tcp_buf_free_old_tap_xmit(void)
+{
+ while (tcp_payload_tap_used) {
+ tap_free_old_xmit(tcp_payload_tap_used);
+
+ tcp_payload_tap_used = 0;
+ tcp_payload_sock_used = 0;
+ }
+}
+
/**
* tcp_payload_flush() - Send out buffers for segments with data or flags
* @c: Execution context
@@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c)
size_t m;
m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
- tcp_payload_used, false);
- if (m != tcp_payload_used) {
+ tcp_payload_sock_used, true);
+ if (m != tcp_payload_sock_used) {
tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
- tcp_payload_used - m);
+ tcp_payload_sock_used - m);
}
- tcp_payload_used = 0;
+ tcp_payload_tap_used += m;
+ tcp_buf_free_old_tap_xmit();
}
/**
@@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
uint32_t seq;
int ret;
- iov = tcp_l2_iov[tcp_payload_used];
+ iov = tcp_l2_iov[tcp_payload_sock_used];
if (CONN_V4(conn)) {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
@@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
if (ret <= 0)
return ret;
- tcp_payload_used++;
+ tcp_payload_sock_produce(1);
l4len = optlen + sizeof(struct tcphdr);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
if (flags & DUP_ACK) {
- struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
+ struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+ tcp_payload_sock_produce(1);
memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_TAP].iov_len);
@@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
}
- if (tcp_payload_used > TCP_FRAMES_MEM - 2)
+ if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
tcp_payload_flush(c);
+ }
return 0;
}
@@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
struct iovec *iov;
conn->seq_to_tap = seq + dlen;
- tcp_frame_conns[tcp_payload_used] = conn;
- iov = tcp_l2_iov[tcp_payload_used];
+ tcp_frame_conns[tcp_payload_sock_used] = conn;
+ iov = tcp_l2_iov[tcp_payload_sock_used];
if (CONN_V4(conn)) {
if (no_csum) {
- struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
+ struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
check = &iph->check;
}
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
@@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
payload->th.psh = push;
iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
- if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
+ tcp_payload_sock_produce(1);
+ if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
tcp_payload_flush(c);
+ }
}
/**
@@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
mh_sock.msg_iovlen = fill_bufs;
}
- if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
+ if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
tcp_payload_flush(c);
-
- /* Silence Coverity CWE-125 false positive */
- tcp_payload_used = 0;
}
for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
- iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
+ iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
iov->iov_len = mss;
}
if (iov_rem)
@@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
dlen = mss;
seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
- int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
+ int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used;
bool push = false;
if (i == send_bufs - 1) {
--
@@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_used;
+static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+static void tcp_payload_sock_produce(size_t n)
+{
+ tcp_payload_sock_used += n;
+}
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
@@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
}
}
+static void tcp_buf_free_old_tap_xmit(void)
+{
+ while (tcp_payload_tap_used) {
+ tap_free_old_xmit(tcp_payload_tap_used);
+
+ tcp_payload_tap_used = 0;
+ tcp_payload_sock_used = 0;
+ }
+}
+
/**
* tcp_payload_flush() - Send out buffers for segments with data or flags
* @c: Execution context
@@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c)
size_t m;
m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
- tcp_payload_used, false);
- if (m != tcp_payload_used) {
+ tcp_payload_sock_used, true);
+ if (m != tcp_payload_sock_used) {
tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
- tcp_payload_used - m);
+ tcp_payload_sock_used - m);
}
- tcp_payload_used = 0;
+ tcp_payload_tap_used += m;
+ tcp_buf_free_old_tap_xmit();
}
/**
@@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
uint32_t seq;
int ret;
- iov = tcp_l2_iov[tcp_payload_used];
+ iov = tcp_l2_iov[tcp_payload_sock_used];
if (CONN_V4(conn)) {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
@@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
if (ret <= 0)
return ret;
- tcp_payload_used++;
+ tcp_payload_sock_produce(1);
l4len = optlen + sizeof(struct tcphdr);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
if (flags & DUP_ACK) {
- struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++];
+ struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+ tcp_payload_sock_produce(1);
memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_TAP].iov_len);
@@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
}
- if (tcp_payload_used > TCP_FRAMES_MEM - 2)
+ if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
tcp_payload_flush(c);
+ }
return 0;
}
@@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
struct iovec *iov;
conn->seq_to_tap = seq + dlen;
- tcp_frame_conns[tcp_payload_used] = conn;
- iov = tcp_l2_iov[tcp_payload_used];
+ tcp_frame_conns[tcp_payload_sock_used] = conn;
+ iov = tcp_l2_iov[tcp_payload_sock_used];
if (CONN_V4(conn)) {
if (no_csum) {
- struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1];
+ struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
check = &iph->check;
}
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
payload = iov[TCP_IOV_PAYLOAD].iov_base;
@@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
payload->th.psh = push;
iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr);
tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
- if (++tcp_payload_used > TCP_FRAMES_MEM - 1)
+ tcp_payload_sock_produce(1);
+ if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
tcp_payload_flush(c);
+ }
}
/**
@@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
mh_sock.msg_iovlen = fill_bufs;
}
- if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) {
+ if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
tcp_payload_flush(c);
-
- /* Silence Coverity CWE-125 false positive */
- tcp_payload_used = 0;
}
for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
- iov->iov_base = &tcp_payload[tcp_payload_used + i].data;
+ iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
iov->iov_len = mss;
}
if (iov_rem)
@@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
dlen = mss;
seq = conn->seq_to_tap;
for (i = 0; i < send_bufs; i++) {
- int no_csum = i && i != send_bufs - 1 && tcp_payload_used;
+ int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used;
bool push = false;
if (i == send_bufs - 1) {
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 10/11] tap: add poll(2) to used_idx
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (8 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
2025-07-10 9:46 ` [RFC v2 00/11] Add vhost-net kernel support Eugenio Perez Martin
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
From ~13Gbit/s to ~11.5Gbit/s.
TODO: Maybe we can reuse epoll for this, not needing to introduce a new
syscall.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tap.c | 59 +++++++++++++++++++++++++++++++++++++++++++++----------
tap.h | 2 +-
tcp_buf.c | 6 +++---
3 files changed, 53 insertions(+), 14 deletions(-)
diff --git a/tap.c b/tap.c
index 55357e3..93a8c12 100644
--- a/tap.c
+++ b/tap.c
@@ -19,6 +19,7 @@
#include <stdio.h>
#include <errno.h>
#include <limits.h>
+#include <poll.h>
#include <string.h>
#include <net/ethernet.h>
#include <net/if.h>
@@ -120,7 +121,7 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf);
#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
"Number of vhost descs must be a power of two by standard");
-static struct {
+static struct vhost_virtqueue {
/* Descriptor index we're using. This is not the same as avail idx in
* split: this takes into account the chained descs */
uint16_t vring_idx;
@@ -472,26 +473,63 @@ static void vhost_kick(struct vring_used *used, int kick_fd) {
eventfd_write(kick_fd, 1);
}
+/*
+ * #syscalls:pasta read poll
+ */
+static uint16_t used_idx(struct vhost_virtqueue *vq,
+ struct vring_avail *avail,
+ const struct vring_used *used, int pollfd)
+{
+ struct pollfd fds = { .fd = pollfd, .events = POLLIN };
+ int r;
+
+ if (vq->shadow_used_idx == vq->last_used_idx)
+ vq->shadow_used_idx = le16toh(used->idx);
+
+ if (vq->shadow_used_idx != vq->last_used_idx || pollfd < 0)
+ return vq->shadow_used_idx;
+
+ avail->flags &= ~htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT);
+ /* trusting syscall for smp_wb() */
+ r = read(pollfd, (uint64_t[]){0}, sizeof(uint64_t));
+ assert((r < 0 && errno == EAGAIN) || r == 8);
+
+ /* Another oportunity before syscalls */
+ vq->shadow_used_idx = le16toh(used->idx);
+ if (vq->shadow_used_idx != vq->last_used_idx) {
+ return vqs->shadow_used_idx;
+ }
+
+ r = poll(&fds, 1, -1);
+ assert (0 < r);
+ avail->flags |= htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT);
+ vq->shadow_used_idx = le16toh(used->idx);
+ return vq->shadow_used_idx;
+}
+
/* n = target */
-void tap_free_old_xmit(size_t n)
+size_t tap_free_old_xmit(const struct ctx *c, size_t n)
{
size_t r = 0;
+ int pollfd = (n == (size_t)-1) ? -1 : c->vq[1].call_fd;
while (r < n) {
- uint16_t used_idx = vqs[1].last_used_idx;
- if (vqs[1].shadow_used_idx == used_idx) {
- vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx);
-
- if (vqs[1].shadow_used_idx == used_idx)
- continue;
+ uint16_t last_used = vqs[1].last_used_idx;
+ if (used_idx(&vqs[1], &vring_avail_1.avail, &vring_used_1.used, pollfd) == last_used) {
+ assert(pollfd == -1);
+ return r;
}
+ /* Only get used array entries after they have been exposed by vhost. */
+ smp_rmb();
/* assert in-order */
- assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]);
- vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS];
+ assert(vring_used_1.used.ring[last_used % VHOST_NDESCS].id == vring_avail_1.avail.ring[last_used % VHOST_NDESCS]);
+ vqs[1].num_free += vqs[1].ndescs[last_used % VHOST_NDESCS];
vqs[1].last_used_idx++;
r++;
}
+
+ return r;
}
/**
@@ -1687,6 +1725,7 @@ static int tap_ns_tun(void *arg)
if (rc < 0)
die_perror("Failed to add call eventfd to epoll");
}
+ fprintf(stderr, "[eperezma %s:%d][i=%d][call_fd=%d]\n", __func__, __LINE__, i, file.fd);
c->vq[i].call_fd = file.fd;
file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
diff --git a/tap.h b/tap.h
index 7ca0fb0..7004116 100644
--- a/tap.h
+++ b/tap.h
@@ -112,7 +112,7 @@ void tap_icmp6_send(const struct ctx *c,
const struct in6_addr *src, const struct in6_addr *dst,
const void *in, size_t l4len);
void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost);
-void tap_free_old_xmit(size_t n);
+size_t tap_free_old_xmit(const struct ctx *c, size_t n);
size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
size_t bufs_per_frame, size_t nframes, bool vhost);
void eth_update_mac(struct ethhdr *eh,
diff --git a/tcp_buf.c b/tcp_buf.c
index 0437120..f74d22d 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -137,10 +137,10 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
}
}
-static void tcp_buf_free_old_tap_xmit(void)
+static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
{
while (tcp_payload_tap_used) {
- tap_free_old_xmit(tcp_payload_tap_used);
+ tap_free_old_xmit(c, tcp_payload_tap_used);
tcp_payload_tap_used = 0;
tcp_payload_sock_used = 0;
@@ -162,7 +162,7 @@ void tcp_payload_flush(const struct ctx *c)
tcp_payload_sock_used - m);
}
tcp_payload_tap_used += m;
- tcp_buf_free_old_tap_xmit();
+ tcp_buf_free_old_tap_xmit(c);
}
/**
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (9 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
@ 2025-07-09 17:47 ` Eugenio Pérez
2025-07-10 9:46 ` [RFC v2 00/11] Add vhost-net kernel support Eugenio Perez Martin
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Pérez @ 2025-07-09 17:47 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
Now both tcp_sock and tap uses the circular buffer as intended.
Very lightly tested. Especially, paths like ring full or almost full
that are checked before producing like
tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM.
Processing the tx buffers in a circular buffer makes namespace rx go
from to ~11.5Gbit/s. to ~17.26Gbit/s.
TODO: Increase the tx queue length, as we spend a lot of descriptors in
each request. Ideally, tx size should be at least
bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with
bigger queues.
TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at
least N used tx buffers and the next one in tcp_payload_flush. Maybe we
can optimize it.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
tcp_buf.c | 130 ++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 106 insertions(+), 24 deletions(-)
diff --git a/tcp_buf.c b/tcp_buf.c
index f74d22d..326af79 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+/*
+ * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it
+ * forward, but errors queueing to vhost can move it backwards to tap_head
+ * again.
+ *
+ * tap_head: Head of buffers that have been sent to vhost. flush moves this
+ * forward.
+ *
+ * tail: Chasing index. Increments when vhost uses buffers.
+ *
+ * _used: Independent variables to tell between full and empty.
+ */
+static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used;
+#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1)))
+static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore");
+static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two");
+
+static size_t tcp_payload_cnt_to_end(size_t head, size_t tail)
+{
+ assert(head != tail);
+ size_t end = ARRAY_SIZE(tcp_payload) - tail;
+ size_t n = (head + end) % ARRAY_SIZE(tcp_payload);
+
+ return MIN(n, end);
+}
+
+/* Count the number of items that has been written from sock to the
+ * curcular buffer and can be sent to tap.
+ */
+static size_t tcp_payload_tap_cnt(void)
+{
+ return tcp_payload_sock_used - tcp_payload_tap_used;
+}
static void tcp_payload_sock_produce(size_t n)
{
+ tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload);
tcp_payload_sock_used += n;
}
+/* Count the number of consecutive items that has been written from sock to the
+ * curcular buffer and can be sent to tap without having to wrap back to the
+ * beginning of the buffer.
+ */
+static size_t tcp_payload_tap_cnt_to_end(void)
+{
+ if (tcp_payload_sock_head == tcp_payload_tap_head) {
+ /* empty? */
+ if (tcp_payload_sock_used - tcp_payload_tap_used == 0)
+ return 0;
+
+ /* full */
+ return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head;
+ }
+
+ return tcp_payload_cnt_to_end(tcp_payload_sock_head,
+ tcp_payload_tap_head);
+}
+
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
@@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
}
}
-static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
+static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target)
{
- while (tcp_payload_tap_used) {
- tap_free_old_xmit(c, tcp_payload_tap_used);
+ size_t n = tap_free_old_xmit(c, target);
- tcp_payload_tap_used = 0;
- tcp_payload_sock_used = 0;
- }
+ tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1);
+ tcp_payload_tap_used -= n;
+ tcp_payload_sock_used -= n;
}
/**
@@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
*/
void tcp_payload_flush(const struct ctx *c)
{
- size_t m;
+ size_t m, n = tcp_payload_tap_cnt_to_end();
+ struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0];
- m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
- tcp_payload_sock_used, true);
- if (m != tcp_payload_sock_used) {
- tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
- tcp_payload_sock_used - m);
- }
+ tcp_buf_free_old_tap_xmit(c, (size_t)-1);
+ m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true);
tcp_payload_tap_used += m;
- tcp_buf_free_old_tap_xmit(c);
+ tcp_payload_tap_head = (tcp_payload_tap_head + m) %
+ ARRAY_SIZE(tcp_payload);
+
+ if (m != n) {
+ n = tcp_payload_tap_cnt_to_end();
+
+ tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head],
+ &tcp_l2_iov[tcp_payload_tap_head], n);
+ /*
+ * circular buffer wrap case.
+ * TODO: Maybe it's better to adapt tcp_revert_seq.
+ */
+ tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0],
+ tcp_payload_tap_cnt() - n);
+
+ tcp_payload_sock_head = tcp_payload_tap_head;
+ tcp_payload_sock_used = tcp_payload_tap_used;
+ } else if (tcp_payload_tap_cnt_to_end()) {
+ /* circular buffer wrap case */
+ tcp_payload_flush(c);
+ }
}
/**
@@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
size_t optlen;
size_t l4len;
uint32_t seq;
+ unsigned int i = tcp_payload_sock_head;
int ret;
- iov = tcp_l2_iov[tcp_payload_sock_used];
+ iov = tcp_l2_iov[i];
if (CONN_V4(conn)) {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
@@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
return ret;
tcp_payload_sock_produce(1);
+ i = tcp_payload_sock_head;
l4len = optlen + sizeof(struct tcphdr);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
if (flags & DUP_ACK) {
- struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+ struct iovec *dup_iov = tcp_l2_iov[i];
tcp_payload_sock_produce(1);
+ i = tcp_payload_sock_head;
memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_TAP].iov_len);
@@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
}
if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
+ tcp_buf_free_old_tap_xmit(c, 2);
tcp_payload_flush(c);
+ /* TODO how to fix this? original code didn't chech for success either */
+ assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2);
}
return 0;
@@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
struct iovec *iov;
conn->seq_to_tap = seq + dlen;
- tcp_frame_conns[tcp_payload_sock_used] = conn;
- iov = tcp_l2_iov[tcp_payload_sock_used];
+ tcp_frame_conns[tcp_payload_sock_head] = conn;
+ iov = tcp_l2_iov[tcp_payload_sock_head];
if (CONN_V4(conn)) {
if (no_csum) {
- struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
+ unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM;
+ struct iovec *iov_prev = tcp_l2_iov[prev];
struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
check = &iph->check;
}
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
@@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
tcp_payload_sock_produce(1);
if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
+ tcp_buf_free_old_tap_xmit(c, 1);
tcp_payload_flush(c);
+ assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1);
}
+
}
/**
@@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
}
if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
+ tcp_buf_free_old_tap_xmit(c, fill_bufs);
tcp_payload_flush(c);
+ /* TODO how to report this to upper layers? */
+ assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM);
}
for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
- iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
+ iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data;
iov->iov_len = mss;
}
if (iov_rem)
--
@@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")
/* References tracking the owner connection of frames in the tap outqueue */
static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM];
-static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;
+
+/*
+ * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it
+ * forward, but errors queueing to vhost can move it backwards to tap_head
+ * again.
+ *
+ * tap_head: Head of buffers that have been sent to vhost. flush moves this
+ * forward.
+ *
+ * tail: Chasing index. Increments when vhost uses buffers.
+ *
+ * _used: Independent variables to tell between full and empty.
+ */
+static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used;
+#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1)))
+static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore");
+static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two");
+
+static size_t tcp_payload_cnt_to_end(size_t head, size_t tail)
+{
+ assert(head != tail);
+ size_t end = ARRAY_SIZE(tcp_payload) - tail;
+ size_t n = (head + end) % ARRAY_SIZE(tcp_payload);
+
+ return MIN(n, end);
+}
+
+/* Count the number of items that has been written from sock to the
+ * curcular buffer and can be sent to tap.
+ */
+static size_t tcp_payload_tap_cnt(void)
+{
+ return tcp_payload_sock_used - tcp_payload_tap_used;
+}
static void tcp_payload_sock_produce(size_t n)
{
+ tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload);
tcp_payload_sock_used += n;
}
+/* Count the number of consecutive items that has been written from sock to the
+ * curcular buffer and can be sent to tap without having to wrap back to the
+ * beginning of the buffer.
+ */
+static size_t tcp_payload_tap_cnt_to_end(void)
+{
+ if (tcp_payload_sock_head == tcp_payload_tap_head) {
+ /* empty? */
+ if (tcp_payload_sock_used - tcp_payload_tap_used == 0)
+ return 0;
+
+ /* full */
+ return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head;
+ }
+
+ return tcp_payload_cnt_to_end(tcp_payload_sock_head,
+ tcp_payload_tap_head);
+}
+
static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];
/**
@@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns,
}
}
-static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
+static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target)
{
- while (tcp_payload_tap_used) {
- tap_free_old_xmit(c, tcp_payload_tap_used);
+ size_t n = tap_free_old_xmit(c, target);
- tcp_payload_tap_used = 0;
- tcp_payload_sock_used = 0;
- }
+ tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1);
+ tcp_payload_tap_used -= n;
+ tcp_payload_sock_used -= n;
}
/**
@@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c)
*/
void tcp_payload_flush(const struct ctx *c)
{
- size_t m;
+ size_t m, n = tcp_payload_tap_cnt_to_end();
+ struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0];
- m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS,
- tcp_payload_sock_used, true);
- if (m != tcp_payload_sock_used) {
- tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m],
- tcp_payload_sock_used - m);
- }
+ tcp_buf_free_old_tap_xmit(c, (size_t)-1);
+ m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true);
tcp_payload_tap_used += m;
- tcp_buf_free_old_tap_xmit(c);
+ tcp_payload_tap_head = (tcp_payload_tap_head + m) %
+ ARRAY_SIZE(tcp_payload);
+
+ if (m != n) {
+ n = tcp_payload_tap_cnt_to_end();
+
+ tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head],
+ &tcp_l2_iov[tcp_payload_tap_head], n);
+ /*
+ * circular buffer wrap case.
+ * TODO: Maybe it's better to adapt tcp_revert_seq.
+ */
+ tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0],
+ tcp_payload_tap_cnt() - n);
+
+ tcp_payload_sock_head = tcp_payload_tap_head;
+ tcp_payload_sock_used = tcp_payload_tap_used;
+ } else if (tcp_payload_tap_cnt_to_end()) {
+ /* circular buffer wrap case */
+ tcp_payload_flush(c);
+ }
}
/**
@@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
size_t optlen;
size_t l4len;
uint32_t seq;
+ unsigned int i = tcp_payload_sock_head;
int ret;
- iov = tcp_l2_iov[tcp_payload_sock_used];
+ iov = tcp_l2_iov[i];
if (CONN_V4(conn)) {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else {
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]);
iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src;
}
@@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
return ret;
tcp_payload_sock_produce(1);
+ i = tcp_payload_sock_head;
l4len = optlen + sizeof(struct tcphdr);
iov[TCP_IOV_PAYLOAD].iov_len = l4len;
tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);
if (flags & DUP_ACK) {
- struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used];
+ struct iovec *dup_iov = tcp_l2_iov[i];
tcp_payload_sock_produce(1);
+ i = tcp_payload_sock_head;
memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base,
iov[TCP_IOV_TAP].iov_len);
@@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
}
if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) {
+ tcp_buf_free_old_tap_xmit(c, 2);
tcp_payload_flush(c);
+ /* TODO how to fix this? original code didn't chech for success either */
+ assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2);
}
return 0;
@@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
struct iovec *iov;
conn->seq_to_tap = seq + dlen;
- tcp_frame_conns[tcp_payload_sock_used] = conn;
- iov = tcp_l2_iov[tcp_payload_sock_used];
+ tcp_frame_conns[tcp_payload_sock_head] = conn;
+ iov = tcp_l2_iov[tcp_payload_sock_head];
if (CONN_V4(conn)) {
if (no_csum) {
- struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1];
+ unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM;
+ struct iovec *iov_prev = tcp_l2_iov[prev];
struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;
check = &iph->check;
}
- iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]);
+ iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]);
iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src;
} else if (CONN_V6(conn)) {
iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]);
@@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
tcp_l2_buf_fill_headers(conn, iov, check, seq, false);
tcp_payload_sock_produce(1);
if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) {
+ tcp_buf_free_old_tap_xmit(c, 1);
tcp_payload_flush(c);
+ assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1);
}
+
}
/**
@@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
}
if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) {
+ tcp_buf_free_old_tap_xmit(c, fill_bufs);
tcp_payload_flush(c);
+ /* TODO how to report this to upper layers? */
+ assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM);
}
for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) {
- iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data;
+ iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data;
iov->iov_len = mss;
}
if (iov_rem)
--
2.50.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [RFC v2 00/11] Add vhost-net kernel support
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
` (10 preceding siblings ...)
2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
@ 2025-07-10 9:46 ` Eugenio Perez Martin
11 siblings, 0 replies; 13+ messages in thread
From: Eugenio Perez Martin @ 2025-07-10 9:46 UTC (permalink / raw)
To: passt-dev; +Cc: jasowang
On Wed, Jul 9, 2025 at 7:48 PM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> vhost-net is a kernel device that allows to read packets from a tap
> device using virtio queues instead of regular read(2) and write(2).
> This enables a more eficient packet processing, as the memory can
> be written directly by the kernel to the userspace and back, instead
> of wasting bandwith on copies, and it enables to batch many packets
> in a single notification (through eventfds) both tx and rx.
>
> Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s.
>
> Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s.
>
For reference, iperf without namespaces and veth both reaches
~47.8Gbit/s in both directions.
> RFC: At this moment only these are supported:
> * Receive l2 packets from the vhost kernel to pasta
> * Send l4 tcp socket received data through vhost-kernel to namespace.
>
> TODO: Add vhost zerocopy in the tests, and compare with veth.
> TODO: Implement at least UDP tx. Or maybe we want UDP to be write(2) because
> of latency?
> TODO: Check style for variable declarations in for loops and use of curly
> brackets as long as they wrap more than a line.
> TODO: kerneldoc style function header comments
>
> --
> v2: Add TCP tx, and integrated some comments from the previous series. Please
> check each patch message for details.
>
> Eugenio Pérez (11):
> tap: implement vhost_call_cb
> tap: add die() on vhost error
> Replace tx tap hdr with virtio_nethdr_mrg_rxbuf
> tcp: export memory regions to vhost
> virtio: Fill .next in tx queue
> tap: move static iov_sock to tcp_buf_data_from_sock
> tap: support tx through vhost
> tap: add tap_free_old_xmit
> tcp: start conversion to circular buffer
> Add poll(2) to used_idx
> tcp_buf: adding TCP tx circular buffer
>
> arp.c | 2 +-
> epoll_type.h | 4 +
> passt.c | 12 +-
> passt.h | 11 +-
> tap.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++++--
> tap.h | 13 +-
> tcp.c | 2 +-
> tcp_buf.c | 179 +++++++++++++++----
> tcp_buf.h | 19 ++
> udp.c | 2 +-
> 10 files changed, 675 insertions(+), 58 deletions(-)
>
> --
> 2.50.0
>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-07-10 9:47 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
2025-07-10 9:46 ` [RFC v2 00/11] Add vhost-net kernel support Eugenio Perez Martin
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).