public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading
@ 2025-10-17 10:31 Laurent Vivier
  2025-10-17 10:31 ` [PATCH v4 1/7] util: Simplify epoll_del() interface to take epollfd directly Laurent Vivier
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

This series refactors how epoll file descriptors are managed throughout
the codebase in preparation for introducing multithreading support.
Currently, passt uses a single global epollfd accessed through the
context structure. With multithreading, each thread will need its own
epollfd managing its subset of flows.

The key changes are:

1. Centralize epoll management by extracting helper functions into a new
   epoll_ctl.c/h module and moving union epoll_ref from passt.h to its
   more logical location in epoll_ctl.h.

2. Simplify epoll_del() to take the epollfd directly rather than
   extracting it from the context structure, reducing coupling between
   epoll operations and the global context.

3. Move epoll registration out of sock_l4_sa() into protocol-specific
   code, giving callers explicit control over which epoll instance
   manages each socket.

4. Replace the boolean in_epoll flag in TCP connections with a threadnb
   (thread number) field in flow_common. This serves dual purposes:
   tracking registration status (FLOW_THREADNB_INVALID = not registered)
   and identifying the owning thread. A thread-to-epollfd mapping allows
   retrieving the actual epoll file descriptor. The threadnb field is 8
   bits, limiting values to 0-254 (255 = FLOW_THREADNB_INVALID).

5. Apply this pattern consistently across all protocol handlers (TCP,
   ICMP, UDP), storing the managing thread number in each flow's common
   structure.

These changes make thread ownership explicit in the flow tracking system,
enabling flows to be managed by different threads (and their epoll
instances), a prerequisite for per-thread epollfd design in upcoming
multithreading work.

Changes since v1:
- New patch: "epoll_ctl: Extract epoll operations" - centralizes epoll
  helpers into a dedicated module and relocates union epoll_ref from
  passt.h to epoll_ctl.h
- Changed epollfd type in flow_common from int to 8-bit bitfield to avoid
  exceeding cacheline size threshold
- Added flow_epollfd_valid() helper to check epoll registration status
- Added flow_set_epollfd() helper to set the epoll instance for a flow

Changes since v2:
- Renamed field in flow_common from epollfd to threadnb (thread number)
  to better reflect that flows are now associated with threads rather
  than directly with epoll file descriptors
- Introduced thread-to-epollfd mapping via threadnb_to_epollfd[] array
  in flow.c, providing an indirection layer between threads and their
  epoll instances
- Renamed/refactored helper functions:
  - flow_set_epollfd() -> flow_epollfd_set() - now takes thread number
    and epollfd parameters
  - Added flow_epollfd_get() - retrieves the epoll fd for a flow's thread
  - flow_epollfd_valid() - now checks threadnb instead of epollfd
- Updated constants: replaced EPOLLFD_* with FLOW_THREADNB_* equivalents
  (e.g., EPOLLFD_INVALID -> FLOW_THREADNB_INVALID)
- Applied thread-based pattern consistently across TCP, ICMP, and UDP

Changes since v3:
- Change warn() by err() in epoll_add()
- Renamed flow_epollfd_valid() to flow_in_epoll(), flow_epollfd_get() to
  flow_epollfd().
- Added flow_thread_register() to register an epollfd to a thread number,
  and called it from main() to register c->epollfd for thread 0 (main loop)
- Replaced flow_epollfd_set() by flow_thread_set() to set the thread
  number of a flow.
- Added new patch:
  "passt: Move main event loop processing into passt_worker()"

Laurent Vivier (7):
  util: Simplify epoll_del() interface to take epollfd directly
  epoll_ctl: Extract epoll operations
  util: Move epoll registration out of sock_l4_sa()
  tcp, flow: Replace per-connection in_epoll flag with threadnb in
    flow_common
  icmp: Use thread-based epoll management for ICMP flows
  udp: Use thread-based epoll management for UDP flows
  passt: Move main event loop processing into passt_worker()

 Makefile     |  22 +++----
 epoll_ctl.c  |  45 ++++++++++++++
 epoll_ctl.h  |  51 ++++++++++++++++
 flow.c       |  63 +++++++++++++++++---
 flow.h       |  14 ++++-
 icmp.c       |  23 +++++---
 passt.c      | 163 ++++++++++++++++++++++++++++-----------------------
 passt.h      |  34 -----------
 pasta.c      |   7 +--
 pif.c        |  32 +++++++---
 repair.c     |  14 ++---
 tap.c        |  15 ++---
 tcp.c        |  41 +++++++------
 tcp_conn.h   |   8 +--
 tcp_splice.c |  26 ++++----
 udp.c        |   2 +-
 udp_flow.c   |  23 ++++++--
 util.c       |  28 +--------
 util.h       |   6 +-
 vhost_user.c |  14 ++---
 vu_common.c  |   2 +-
 21 files changed, 387 insertions(+), 246 deletions(-)
 create mode 100644 epoll_ctl.c
 create mode 100644 epoll_ctl.h

-- 
2.51.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 1/7] util: Simplify epoll_del() interface to take epollfd directly
  2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
@ 2025-10-17 10:31 ` Laurent Vivier
  2025-10-17 10:31 ` [PATCH v4 2/7] epoll_ctl: Extract epoll operations Laurent Vivier
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

Change epoll_del() to accept the epoll file descriptor directly instead
of the full context structure. This simplifies the interface and aligns
with the threading refactoring by reducing dependency on the context
structure for basic epoll operations as we will manage an epollfd per
thread.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c       | 2 +-
 tap.c        | 2 +-
 tcp.c        | 6 +++---
 tcp_splice.c | 4 ++--
 udp_flow.c   | 4 ++--
 util.c       | 6 +++---
 util.h       | 2 +-
 vhost_user.c | 6 +++---
 8 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/icmp.c b/icmp.c
index 6dffafb0bf54..bd3108a21675 100644
--- a/icmp.c
+++ b/icmp.c
@@ -151,7 +151,7 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	epoll_del(c, pingf->sock);
+	epoll_del(c->epollfd, pingf->sock);
 	close(pingf->sock);
 	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
 }
diff --git a/tap.c b/tap.c
index f3d1f66041f8..9812f120d426 100644
--- a/tap.c
+++ b/tap.c
@@ -1142,7 +1142,7 @@ void tap_sock_reset(struct ctx *c)
 	}
 
 	/* Close the connected socket, wait for a new connection */
-	epoll_del(c, c->fd_tap);
+	epoll_del(c->epollfd, c->fd_tap);
 	close(c->fd_tap);
 	c->fd_tap = -1;
 	if (c->mode == MODE_VU)
diff --git a/tcp.c b/tcp.c
index 0f9e9b3fdc03..745353f782f5 100644
--- a/tcp.c
+++ b/tcp.c
@@ -511,9 +511,9 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 
 	if (conn->events == CLOSED) {
 		if (conn->in_epoll)
-			epoll_del(c, conn->sock);
+			epoll_del(c->epollfd, conn->sock);
 		if (conn->timer != -1)
-			epoll_del(c, conn->timer);
+			epoll_del(c->epollfd, conn->timer);
 		return 0;
 	}
 
@@ -3476,7 +3476,7 @@ int tcp_flow_migrate_source_ext(const struct ctx *c,
 	if (c->migrate_no_linger)
 		close(s);
 	else
-		epoll_del(c, s);
+		epoll_del(c->epollfd, s);
 
 	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
 	 * based on the end of the queues.
diff --git a/tcp_splice.c b/tcp_splice.c
index 26cb63064583..666ee62b738f 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -204,8 +204,8 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 	}
 
 	if (flag == CLOSING) {
-		epoll_del(c, conn->s[0]);
-		epoll_del(c, conn->s[1]);
+		epoll_del(c->epollfd, conn->s[0]);
+		epoll_del(c->epollfd, conn->s[1]);
 	}
 }
 
diff --git a/udp_flow.c b/udp_flow.c
index cef3fb588bbe..84973f807167 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -51,7 +51,7 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 	flow_foreach_sidei(sidei) {
 		flow_hash_remove(c, FLOW_SIDX(uflow, sidei));
 		if (uflow->s[sidei] >= 0) {
-			epoll_del(c, uflow->s[sidei]);
+			epoll_del(c->epollfd, uflow->s[sidei]);
 			close(uflow->s[sidei]);
 			uflow->s[sidei] = -1;
 		}
@@ -88,7 +88,7 @@ static int udp_flow_sock(const struct ctx *c,
 	if (flowside_connect(c, s, pif, side) < 0) {
 		int rc = -errno;
 
-		epoll_del(c, s);
+		epoll_del(c->epollfd, s);
 		close(s);
 
 		flow_dbg_perror(uflow, "Couldn't connect flow socket");
diff --git a/util.c b/util.c
index c492f904b3fc..1067486be414 100644
--- a/util.c
+++ b/util.c
@@ -996,12 +996,12 @@ void raw_random(void *buf, size_t buflen)
 
 /**
  * epoll_del() - Remove a file descriptor from our passt epoll
- * @c:		Execution context
+ * @epollfd:	epoll file descriptor to remove from
  * @fd:		File descriptor to remove
  */
-void epoll_del(const struct ctx *c, int fd)
+void epoll_del(int epollfd, int fd)
 {
-	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, fd, NULL);
+	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
 
 }
 
diff --git a/util.h b/util.h
index 22eaac56e719..c61cbef357aa 100644
--- a/util.h
+++ b/util.h
@@ -300,7 +300,7 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m)
 #define FPRINTF(f, ...)	(void)fprintf(f, __VA_ARGS__)
 
 void raw_random(void *buf, size_t buflen);
-void epoll_del(const struct ctx *c, int fd);
+void epoll_del(int epollfd, int fd);
 
 /*
  * Starting from glibc 2.40.9000 and commit 25a5eb4010df ("string: strerror,
diff --git a/vhost_user.c b/vhost_user.c
index 223332d5018e..f8324c59cc6c 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -733,7 +733,7 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
 		vdev->vq[idx].call_fd = -1;
 	}
 	if (vdev->vq[idx].kick_fd != -1) {
-		epoll_del(vdev->context, vdev->vq[idx].kick_fd);
+		epoll_del(vdev->context->epollfd, vdev->vq[idx].kick_fd);
 		close(vdev->vq[idx].kick_fd);
 		vdev->vq[idx].kick_fd = -1;
 	}
@@ -801,7 +801,7 @@ static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
 	vu_check_queue_msg_file(vmsg);
 
 	if (vdev->vq[idx].kick_fd != -1) {
-		epoll_del(vdev->context, vdev->vq[idx].kick_fd);
+		epoll_del(vdev->context->epollfd, vdev->vq[idx].kick_fd);
 		close(vdev->vq[idx].kick_fd);
 		vdev->vq[idx].kick_fd = -1;
 	}
@@ -1093,7 +1093,7 @@ void vu_cleanup(struct vu_dev *vdev)
 			vq->err_fd = -1;
 		}
 		if (vq->kick_fd != -1) {
-			epoll_del(vdev->context, vq->kick_fd);
+			epoll_del(vdev->context->epollfd, vq->kick_fd);
 			close(vq->kick_fd);
 			vq->kick_fd = -1;
 		}
-- 
2.51.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 2/7] epoll_ctl: Extract epoll operations
  2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
  2025-10-17 10:31 ` [PATCH v4 1/7] util: Simplify epoll_del() interface to take epollfd directly Laurent Vivier
@ 2025-10-17 10:31 ` Laurent Vivier
  2025-10-17 11:48   ` Stefano Brivio
  2025-10-20  1:20   ` David Gibson
  2025-10-17 10:31 ` [PATCH v4 3/7] util: Move epoll registration out of sock_l4_sa() Laurent Vivier
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Centralize epoll_add() and epoll_del() helper functions into new
epoll_ctl.c/h files.

This also moves the union epoll_ref definition from passt.h to
epoll_ctl.h where it's more logically placed.

The new epoll_add() helper simplifies adding file descriptors to epoll
by taking an epoll_ref and events, handling error reporting
consistently across all call sites.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile     | 22 +++++++++++-----------
 epoll_ctl.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
 epoll_ctl.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 icmp.c       |  4 +---
 passt.c      |  2 +-
 passt.h      | 34 ----------------------------------
 pasta.c      |  7 +++----
 repair.c     | 14 +++++---------
 tap.c        | 13 ++++---------
 tcp.c        |  2 +-
 tcp_splice.c |  2 +-
 udp.c        |  2 +-
 udp_flow.c   |  1 +
 util.c       | 22 +++-------------------
 util.h       |  4 +++-
 vhost_user.c |  8 ++------
 vu_common.c  |  2 +-
 17 files changed, 134 insertions(+), 101 deletions(-)
 create mode 100644 epoll_ctl.c
 create mode 100644 epoll_ctl.h

diff --git a/Makefile b/Makefile
index 3328f8324140..91e037b8fd3c 100644
--- a/Makefile
+++ b/Makefile
@@ -37,23 +37,23 @@ FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
 FLAGS += -DVERSION=\"$(VERSION)\"
 FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
-PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
-	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
-	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
-	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
+PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c epoll_ctl.c \
+	flow.c fwd.c icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c \
+	log.c mld.c ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c \
+	pif.c repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c \
+	udp_flow.c udp_vu.c util.c vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 PASST_REPAIR_SRCS = passt-repair.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
 
 MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
 
-PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
-	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
-	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
-	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
-	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
-	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
+PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h epoll_ctl.h \
+	flow.h fwd.h flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h \
+	isolation.h lineread.h log.h migrate.h ndp.h netlink.h packet.h \
+	passt.h pasta.h pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h \
+	tcp_conn.h tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h \
+	udp_internal.h udp_vu.h util.h vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/epoll_ctl.c b/epoll_ctl.c
new file mode 100644
index 000000000000..7a520560aeb9
--- /dev/null
+++ b/epoll_ctl.c
@@ -0,0 +1,45 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* epoll_ctl.c - epoll manipulation helpers
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#include <errno.h>
+
+#include "epoll_ctl.h"
+
+/**
+ * epoll_add() - Add a file descriptor to an epollfd
+ * @epollfd:	epoll file descriptor to add to
+ * @events:	epoll events
+ * @ref:	epoll reference for the file descriptor (includes fd and metadata)
+ *
+ * Return: 0 on success, negative errno on failure
+ */
+int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref)
+{
+	struct epoll_event ev;
+	int ret;
+
+	ev.events = events;
+	ev.data.u64 = ref->u64;
+
+	ret = epoll_ctl(epollfd, EPOLL_CTL_ADD, ref->fd, &ev);
+	if (ret == -1) {
+		ret = -errno;
+		err("Failed to add fd to epoll: %s", strerror_(-ret));
+	}
+
+	return ret;
+}
+
+/**
+ * epoll_del() - Remove a file descriptor from an epollfd
+ * @epollfd:	epoll file descriptor to remove from
+ * @fd:		File descriptor to remove
+ */
+void epoll_del(int epollfd, int fd)
+{
+	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
+}
diff --git a/epoll_ctl.h b/epoll_ctl.h
new file mode 100644
index 000000000000..cf92b0f63f26
--- /dev/null
+++ b/epoll_ctl.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef EPOLL_CTL_H
+#define EPOLL_CTL_H
+
+#include <sys/epoll.h>
+
+#include "util.h"
+#include "passt.h"
+#include "epoll_type.h"
+#include "flow.h"
+#include "tcp.h"
+#include "udp.h"
+
+/**
+ * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
+ * @type:	Type of fd (tells us what to do with events)
+ * @fd:		File descriptor number (implies < 2^24 total descriptors)
+ * @flow:	Index of the flow this fd is linked to
+ * @tcp_listen:	TCP-specific reference part for listening sockets
+ * @udp:	UDP-specific reference part
+ * @data:	Data handled by protocol handlers
+ * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
+ * @queue:	vhost-user queue index for this fd
+ * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
+ */
+union epoll_ref {
+	struct {
+		enum epoll_type type:8;
+		int32_t		  fd:FD_REF_BITS;
+		union {
+			uint32_t flow;
+			flow_sidx_t flowside;
+			union tcp_listen_epoll_ref tcp_listen;
+			union udp_listen_epoll_ref udp;
+			uint32_t data;
+			int nsdir_fd;
+			int queue;
+		};
+	};
+	uint64_t u64;
+};
+static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
+	      "epoll_ref must have same size as epoll_data");
+
+int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref);
+void epoll_del(int epollfd, int fd);
+#endif /* EPOLL_CTL_H */
diff --git a/icmp.c b/icmp.c
index bd3108a21675..c26561da80bf 100644
--- a/icmp.c
+++ b/icmp.c
@@ -15,7 +15,6 @@
 #include <errno.h>
 #include <net/ethernet.h>
 #include <net/if.h>
-#include <netinet/in.h>
 #include <netinet/ip.h>
 #include <netinet/ip_icmp.h>
 #include <stdio.h>
@@ -23,10 +22,8 @@
 #include <stdint.h>
 #include <stddef.h>
 #include <string.h>
-#include <sys/epoll.h>
 #include <sys/types.h>
 #include <sys/socket.h>
-#include <unistd.h>
 #include <time.h>
 
 #include <linux/icmpv6.h>
@@ -41,6 +38,7 @@
 #include "inany.h"
 #include "icmp.h"
 #include "flow_table.h"
+#include "epoll_ctl.h"
 
 #define ICMP_ECHO_TIMEOUT	60 /* s, timeout for ICMP socket activity */
 #define ICMP_NUM_IDS		(1U << 16)
diff --git a/passt.c b/passt.c
index bdb7b6935f0c..af928111786b 100644
--- a/passt.c
+++ b/passt.c
@@ -19,7 +19,6 @@
  * created in a separate network namespace).
  */
 
-#include <sys/epoll.h>
 #include <fcntl.h>
 #include <sys/mman.h>
 #include <sys/resource.h>
@@ -53,6 +52,7 @@
 #include "vu_common.h"
 #include "migrate.h"
 #include "repair.h"
+#include "epoll_ctl.h"
 
 #define NUM_EPOLL_EVENTS	8
 
diff --git a/passt.h b/passt.h
index 0075eb4b3b16..befe56bb167b 100644
--- a/passt.h
+++ b/passt.h
@@ -35,40 +35,6 @@ union epoll_ref;
 #define MAC_OUR_LAA	\
 	((uint8_t [ETH_ALEN]){0x9a, 0x55, 0x9a, 0x55, 0x9a, 0x55})
 
-/**
- * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
- * @type:	Type of fd (tells us what to do with events)
- * @fd:		File descriptor number (implies < 2^24 total descriptors)
- * @flow:	Index of the flow this fd is linked to
- * @tcp_listen:	TCP-specific reference part for listening sockets
- * @udp:	UDP-specific reference part
- * @icmp:	ICMP-specific reference part
- * @data:	Data handled by protocol handlers
- * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
- * @queue:	vhost-user queue index for this fd
- * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
- */
-union epoll_ref {
-	struct {
-		enum epoll_type type:8;
-#define FD_REF_BITS		24
-#define FD_REF_MAX		((int)MAX_FROM_BITS(FD_REF_BITS))
-		int32_t		fd:FD_REF_BITS;
-		union {
-			uint32_t flow;
-			flow_sidx_t flowside;
-			union tcp_listen_epoll_ref tcp_listen;
-			union udp_listen_epoll_ref udp;
-			uint32_t data;
-			int nsdir_fd;
-			int queue;
-		};
-	};
-	uint64_t u64;
-};
-static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
-	      "epoll_ref must have same size as epoll_data");
-
 /* Large enough for ~128 maximum size frames */
 #define PKT_BUF_BYTES		(8UL << 20)
 
diff --git a/pasta.c b/pasta.c
index 687406b6e736..e905f6d33b95 100644
--- a/pasta.c
+++ b/pasta.c
@@ -27,7 +27,6 @@
 #include <stdint.h>
 #include <unistd.h>
 #include <syslog.h>
-#include <sys/epoll.h>
 #include <sys/inotify.h>
 #include <sys/mount.h>
 #include <sys/timerfd.h>
@@ -49,6 +48,7 @@
 #include "isolation.h"
 #include "netlink.h"
 #include "log.h"
+#include "epoll_ctl.h"
 
 #define HOSTNAME_PREFIX		"pasta-"
 
@@ -444,7 +444,6 @@ static int pasta_netns_quit_timer(void)
  */
 void pasta_netns_quit_init(const struct ctx *c)
 {
-	struct epoll_event ev = { .events = EPOLLIN };
 	int flags = O_NONBLOCK | O_CLOEXEC;
 	struct statfs s = { 0 };
 	bool try_inotify = true;
@@ -487,8 +486,8 @@ void pasta_netns_quit_init(const struct ctx *c)
 		die("netns monitor file number %i too big, exiting", fd);
 
 	ref.fd = fd;
-	ev.data.u64 = ref.u64;
-	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev);
+
+	epoll_add(c->epollfd, EPOLLIN, &ref);
 }
 
 /**
diff --git a/repair.c b/repair.c
index f6b1bf36479c..c8f4737fa62a 100644
--- a/repair.c
+++ b/repair.c
@@ -22,6 +22,7 @@
 #include "inany.h"
 #include "flow.h"
 #include "flow_table.h"
+#include "epoll_ctl.h"
 
 #include "repair.h"
 
@@ -47,7 +48,6 @@ static int repair_nfds;
 void repair_sock_init(const struct ctx *c)
 {
 	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
-	struct epoll_event ev = { 0 };
 
 	if (c->fd_repair_listen == -1)
 		return;
@@ -58,9 +58,7 @@ void repair_sock_init(const struct ctx *c)
 	}
 
 	ref.fd = c->fd_repair_listen;
-	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
-	ev.data.u64 = ref.u64;
-	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
+	if (epoll_add(c->epollfd, EPOLLIN | EPOLLHUP | EPOLLET, &ref))
 		err_perror("repair helper socket epoll_ctl(), won't migrate");
 }
 
@@ -74,7 +72,6 @@ void repair_sock_init(const struct ctx *c)
 int repair_listen_handler(struct ctx *c, uint32_t events)
 {
 	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
-	struct epoll_event ev = { 0 };
 	struct ucred ucred;
 	socklen_t len;
 	int rc;
@@ -112,10 +109,9 @@ int repair_listen_handler(struct ctx *c, uint32_t events)
 		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
 
 	ref.fd = c->fd_repair;
-	ev.events = EPOLLHUP | EPOLLET;
-	ev.data.u64 = ref.u64;
-	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
-		rc = errno;
+
+	rc = epoll_add(c->epollfd, EPOLLHUP | EPOLLET, &ref);
+	if (rc < 0) {
 		debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
 		close(c->fd_repair);
 		c->fd_repair = -1;
diff --git a/tap.c b/tap.c
index 9812f120d426..5b9403dce25b 100644
--- a/tap.c
+++ b/tap.c
@@ -26,7 +26,6 @@
 #include <netinet/in.h>
 #include <arpa/inet.h>
 #include <stdint.h>
-#include <sys/epoll.h>
 #include <sys/ioctl.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -61,6 +60,7 @@
 #include "log.h"
 #include "vhost_user.h"
 #include "vu_common.h"
+#include "epoll_ctl.h"
 
 /* Maximum allowed frame lengths (including L2 header) */
 
@@ -1327,14 +1327,12 @@ static void tap_backend_show_hints(struct ctx *c)
 static void tap_sock_unix_init(const struct ctx *c)
 {
 	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
-	struct epoll_event ev = { 0 };
 
 	listen(c->fd_tap_listen, 0);
 
 	ref.fd = c->fd_tap_listen;
-	ev.events = EPOLLIN | EPOLLET;
-	ev.data.u64 = ref.u64;
-	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
+
+	epoll_add(c->epollfd, EPOLLIN | EPOLLET, &ref);
 }
 
 /**
@@ -1343,7 +1341,6 @@ static void tap_sock_unix_init(const struct ctx *c)
  */
 static void tap_start_connection(const struct ctx *c)
 {
-	struct epoll_event ev = { 0 };
 	union epoll_ref ref = { 0 };
 
 	ref.fd = c->fd_tap;
@@ -1359,9 +1356,7 @@ static void tap_start_connection(const struct ctx *c)
 		break;
 	}
 
-	ev.events = EPOLLIN | EPOLLRDHUP;
-	ev.data.u64 = ref.u64;
-	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
+	epoll_add(c->epollfd, EPOLLIN | EPOLLRDHUP, &ref);
 
 	if (c->ifi4)
 		arp_send_init_req(c);
diff --git a/tcp.c b/tcp.c
index 745353f782f5..db9f17c0622f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -279,7 +279,6 @@
 #include <stdbool.h>
 #include <stddef.h>
 #include <string.h>
-#include <sys/epoll.h>
 #include <sys/ioctl.h>
 #include <sys/socket.h>
 #include <sys/timerfd.h>
@@ -309,6 +308,7 @@
 #include "tcp_internal.h"
 #include "tcp_buf.h"
 #include "tcp_vu.h"
+#include "epoll_ctl.h"
 
 /*
  * The size of TCP header (including options) is given by doff (Data Offset)
diff --git a/tcp_splice.c b/tcp_splice.c
index 666ee62b738f..6f21184bdc55 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -44,7 +44,6 @@
 #include <net/ethernet.h>
 #include <netinet/in.h>
 #include <netinet/tcp.h>
-#include <sys/epoll.h>
 #include <sys/types.h>
 #include <sys/socket.h>
 
@@ -56,6 +55,7 @@
 #include "siphash.h"
 #include "inany.h"
 #include "flow.h"
+#include "epoll_ctl.h"
 
 #include "flow_table.h"
 
diff --git a/udp.c b/udp.c
index 86585b7e0942..3812d5c2336f 100644
--- a/udp.c
+++ b/udp.c
@@ -94,7 +94,6 @@
 #include <stdint.h>
 #include <stddef.h>
 #include <string.h>
-#include <sys/epoll.h>
 #include <sys/types.h>
 #include <sys/socket.h>
 #include <sys/uio.h>
@@ -115,6 +114,7 @@
 #include "flow_table.h"
 #include "udp_internal.h"
 #include "udp_vu.h"
+#include "epoll_ctl.h"
 
 #define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
 
diff --git a/udp_flow.c b/udp_flow.c
index 84973f807167..d9c75f1bb1d8 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -15,6 +15,7 @@
 #include "passt.h"
 #include "flow_table.h"
 #include "udp_internal.h"
+#include "epoll_ctl.h"
 
 #define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
 
diff --git a/util.c b/util.c
index 1067486be414..b2490123590a 100644
--- a/util.c
+++ b/util.c
@@ -18,7 +18,6 @@
 #include <unistd.h>
 #include <arpa/inet.h>
 #include <net/ethernet.h>
-#include <sys/epoll.h>
 #include <sys/uio.h>
 #include <fcntl.h>
 #include <string.h>
@@ -35,6 +34,7 @@
 #include "packet.h"
 #include "log.h"
 #include "pcap.h"
+#include "epoll_ctl.h"
 #ifdef HAS_GETRANDOM
 #include <sys/random.h>
 #endif
@@ -58,7 +58,6 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	sa_family_t af = ((const struct sockaddr *)sa)->sa_family;
 	union epoll_ref ref = { .type = type, .data = data };
 	bool freebind = false;
-	struct epoll_event ev;
 	int fd, y = 1, ret;
 	uint8_t proto;
 	int socktype;
@@ -172,13 +171,9 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 		return ret;
 	}
 
-	ev.events = EPOLLIN;
-	ev.data.u64 = ref.u64;
-	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
-		ret = -errno;
-		warn("L4 epoll_ctl: %s", strerror_(-ret));
+	ret = epoll_add(c->epollfd, EPOLLIN, &ref);
+	if (ret < 0)
 		return ret;
-	}
 
 	return fd;
 }
@@ -994,17 +989,6 @@ void raw_random(void *buf, size_t buflen)
 		die("Unexpected EOF on random data source");
 }
 
-/**
- * epoll_del() - Remove a file descriptor from our passt epoll
- * @epollfd:	epoll file descriptor to remove from
- * @fd:		File descriptor to remove
- */
-void epoll_del(int epollfd, int fd)
-{
-	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
-
-}
-
 /**
  * encode_domain_name() - Encode domain name according to RFC 1035, section 3.1
  * @buf:		Buffer to fill in with encoded domain name
diff --git a/util.h b/util.h
index c61cbef357aa..8e4b4c5c6032 100644
--- a/util.h
+++ b/util.h
@@ -193,6 +193,9 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 #define SNDBUF_BIG		(4ULL * 1024 * 1024)
 #define SNDBUF_SMALL		(128ULL * 1024)
 
+#define FD_REF_BITS		24
+#define FD_REF_MAX		((int)MAX_FROM_BITS(FD_REF_BITS))
+
 #include <net/if.h>
 #include <limits.h>
 #include <stdint.h>
@@ -300,7 +303,6 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m)
 #define FPRINTF(f, ...)	(void)fprintf(f, __VA_ARGS__)
 
 void raw_random(void *buf, size_t buflen);
-void epoll_del(int epollfd, int fd);
 
 /*
  * Starting from glibc 2.40.9000 and commit 25a5eb4010df ("string: strerror,
diff --git a/vhost_user.c b/vhost_user.c
index f8324c59cc6c..aea1e2cbcea5 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -32,8 +32,6 @@
 #include <inttypes.h>
 #include <time.h>
 #include <net/ethernet.h>
-#include <netinet/in.h>
-#include <sys/epoll.h>
 #include <sys/eventfd.h>
 #include <sys/mman.h>
 #include <linux/vhost_types.h>
@@ -45,6 +43,7 @@
 #include "vhost_user.h"
 #include "pcap.h"
 #include "migrate.h"
+#include "epoll_ctl.h"
 
 /* vhost-user version we are compatible with */
 #define VHOST_USER_VERSION 1
@@ -753,11 +752,8 @@ static void vu_set_watch(const struct vu_dev *vdev, int idx)
 		.fd = vdev->vq[idx].kick_fd,
 		.queue = idx
 	 };
-	struct epoll_event ev = { 0 };
 
-	ev.data.u64 = ref.u64;
-	ev.events = EPOLLIN;
-	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
+	epoll_add(vdev->context->epollfd, EPOLLIN, &ref);
 }
 
 /**
diff --git a/vu_common.c b/vu_common.c
index b716070ea3c3..b13b7c308fd8 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -6,7 +6,6 @@
  */
 
 #include <errno.h>
-#include <unistd.h>
 #include <sys/uio.h>
 #include <sys/eventfd.h>
 #include <netinet/if_ether.h>
@@ -19,6 +18,7 @@
 #include "pcap.h"
 #include "vu_common.h"
 #include "migrate.h"
+#include "epoll_ctl.h"
 
 #define VU_MAX_TX_BUFFER_NB	2
 
-- 
2.51.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 3/7] util: Move epoll registration out of sock_l4_sa()
  2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
  2025-10-17 10:31 ` [PATCH v4 1/7] util: Simplify epoll_del() interface to take epollfd directly Laurent Vivier
  2025-10-17 10:31 ` [PATCH v4 2/7] epoll_ctl: Extract epoll operations Laurent Vivier
@ 2025-10-17 10:31 ` Laurent Vivier
  2025-10-17 10:31 ` [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common Laurent Vivier
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

Move epoll_add() calls from sock_l4_sa() to the protocol-specific code
(icmp.c, pif.c, udp_flow.c) to give callers more control over epoll
registration. This allows sock_l4_sa() to focus solely on socket
creation and binding, while epoll management happens at a higher level.

Remove the data parameter from sock_l4_sa() and flowside_sock_l4() as
it's no longer needed - callers now construct the full epoll_ref and
register the socket themselves after creation.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c     | 10 ++++------
 flow.h     |  2 +-
 icmp.c     | 15 +++++++++++----
 pif.c      | 32 +++++++++++++++++++++++++-------
 udp_flow.c | 16 ++++++++++++++--
 util.c     | 10 +---------
 util.h     |  2 +-
 7 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/flow.c b/flow.c
index feefda3ce74e..b14e9d8b63ff 100644
--- a/flow.c
+++ b/flow.c
@@ -163,7 +163,6 @@ static void flowside_from_af(struct flowside *side, sa_family_t af,
  * @type:	Socket epoll type
  * @sa:		Socket address
  * @sl:		Length of @sa
- * @data:	epoll reference data
  */
 struct flowside_sock_args {
 	const struct ctx *c;
@@ -173,7 +172,6 @@ struct flowside_sock_args {
 	const struct sockaddr *sa;
 	socklen_t sl;
 	const char *path;
-	uint32_t data;
 };
 
 /** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside
@@ -188,7 +186,7 @@ static int flowside_sock_splice(void *arg)
 	ns_enter(a->c);
 
 	a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,
-	                   a->sa->sa_family == AF_INET6, a->data);
+	                   a->sa->sa_family == AF_INET6);
 	a->err = errno;
 
 	return 0;
@@ -205,7 +203,7 @@ static int flowside_sock_splice(void *arg)
  *         (if specified).
  */
 int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
-		     const struct flowside *tgt, uint32_t data)
+		     const struct flowside *tgt)
 {
 	const char *ifname = NULL;
 	union sockaddr_inany sa;
@@ -225,12 +223,12 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 			ifname = c->ip6.ifname_out;
 
 		return sock_l4_sa(c, type, &sa, sl, ifname,
-				  sa.sa_family == AF_INET6, data);
+				  sa.sa_family == AF_INET6);
 
 	case PIF_SPLICE: {
 		struct flowside_sock_args args = {
 			.c = c, .type = type,
-			.sa = &sa.sa, .sl = sl, .data = data,
+			.sa = &sa.sa, .sl = sl,
 		};
 		NS_CALL(flowside_sock_splice, &args);
 		errno = args.err;
diff --git a/flow.h b/flow.h
index cac618ad0ca1..ef138b83add8 100644
--- a/flow.h
+++ b/flow.h
@@ -167,7 +167,7 @@ static inline bool flowside_eq(const struct flowside *left,
 }
 
 int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
-		     const struct flowside *tgt, uint32_t data);
+		     const struct flowside *tgt);
 int flowside_connect(const struct ctx *c, int s,
 		     uint8_t pif, const struct flowside *tgt);
 
diff --git a/icmp.c b/icmp.c
index c26561da80bf..56dfac6c958e 100644
--- a/icmp.c
+++ b/icmp.c
@@ -170,10 +170,10 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 {
 	uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6;
 	uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
-	union epoll_ref ref = { .type = EPOLL_TYPE_PING };
 	union flow *flow = flow_alloc();
 	struct icmp_ping_flow *pingf;
 	const struct flowside *tgt;
+	union epoll_ref ref;
 
 	if (!flow)
 		return NULL;
@@ -194,9 +194,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 
 	pingf->seq = -1;
 
-	ref.flowside = FLOW_SIDX(flow, TGTSIDE);
-	pingf->sock = flowside_sock_l4(c, EPOLL_TYPE_PING, PIF_HOST,
-				       tgt, ref.data);
+	pingf->sock = flowside_sock_l4(c, EPOLL_TYPE_PING, PIF_HOST, tgt);
 
 	if (pingf->sock < 0) {
 		warn("Cannot open \"ping\" socket. You might need to:");
@@ -208,6 +206,15 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (pingf->sock > FD_REF_MAX)
 		goto cancel;
 
+	ref.type = EPOLL_TYPE_PING;
+	ref.flowside = FLOW_SIDX(flow, TGTSIDE);
+	ref.fd = pingf->sock;
+
+	if (epoll_add(c->epollfd, EPOLLIN, &ref) < 0) {
+		close(pingf->sock);
+		goto cancel;
+	}
+
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
 	flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE));
diff --git a/pif.c b/pif.c
index 592fafaab58a..17b41705183c 100644
--- a/pif.c
+++ b/pif.c
@@ -5,6 +5,7 @@
  * Passt/pasta interface types and IDs
  */
 
+#include <errno.h>
 #include <stdint.h>
 #include <assert.h>
 #include <netinet/in.h>
@@ -14,7 +15,7 @@
 #include "siphash.h"
 #include "ip.h"
 #include "inany.h"
-#include "passt.h"
+#include "epoll_ctl.h"
 
 const char *pif_type_str[] = {
 	[PIF_NONE]		= "<none>",
@@ -83,7 +84,9 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 		.sa6.sin6_addr = in6addr_any,
 		.sa6.sin6_port = htons(port),
 	};
+	union epoll_ref ref;
 	socklen_t sl;
+	int ret;
 
 	ASSERT(pif_is_socket(pif));
 
@@ -93,11 +96,26 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 		ASSERT(addr && inany_is_loopback(addr));
 	}
 
-	if (!addr)
-		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
-				  ifname, false, data);
+	if (!addr) {
+		ref.fd = sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
+				    ifname, false);
+	} else {
+		pif_sockaddr(c, &sa, &sl, pif, addr, port);
+		ref.fd = sock_l4_sa(c, type, &sa, sl,
+				    ifname, sa.sa_family == AF_INET6);
+	}
+
+	if (ref.fd < 0)
+		return ref.fd;
+
+	ref.type = type;
+	ref.data = data;
+
+	ret = epoll_add(c->epollfd, EPOLLIN, &ref);
+	if (ret < 0) {
+		close(ref.fd);
+		return ret;
+	}
 
-	pif_sockaddr(c, &sa, &sl, pif, addr, port);
-	return sock_l4_sa(c, type, &sa, sl,
-			  ifname, sa.sa_family == AF_INET6, data);
+	return ref.fd;
 }
diff --git a/udp_flow.c b/udp_flow.c
index d9c75f1bb1d8..caaf3dababb1 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -78,16 +78,28 @@ static int udp_flow_sock(const struct ctx *c,
 		flow_sidx_t sidx;
 		uint32_t data;
 	} fref = { .sidx = FLOW_SIDX(uflow, sidei) };
+	union epoll_ref ref;
+	int rc;
 	int s;
 
-	s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
+	s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side);
 	if (s < 0) {
 		flow_dbg_perror(uflow, "Couldn't open flow specific socket");
 		return s;
 	}
 
+	ref.type = EPOLL_TYPE_UDP;
+	ref.data = fref.data;
+	ref.fd = s;
+
+	rc = epoll_add(c->epollfd, EPOLLIN, &ref);
+	if (rc < 0) {
+		close(s);
+		return rc;
+	}
+
 	if (flowside_connect(c, s, pif, side) < 0) {
-		int rc = -errno;
+		rc = -errno;
 
 		epoll_del(c->epollfd, s);
 		close(s);
diff --git a/util.c b/util.c
index b2490123590a..707ab3388d13 100644
--- a/util.c
+++ b/util.c
@@ -47,16 +47,14 @@
  * @sl:		Length of @sa
  * @ifname:	Interface for binding, NULL for any
  * @v6only:	Set IPV6_V6ONLY socket option
- * @data:	epoll reference portion for protocol handlers
  *
  * Return: newly created socket, negative error code on failure
  */
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	       const void *sa, socklen_t sl,
-	       const char *ifname, bool v6only, uint32_t data)
+	       const char *ifname, bool v6only)
 {
 	sa_family_t af = ((const struct sockaddr *)sa)->sa_family;
-	union epoll_ref ref = { .type = type, .data = data };
 	bool freebind = false;
 	int fd, y = 1, ret;
 	uint8_t proto;
@@ -99,8 +97,6 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 		return -EBADF;
 	}
 
-	ref.fd = fd;
-
 	if (v6only)
 		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
 			debug("Failed to set IPV6_V6ONLY on socket %i", fd);
@@ -171,10 +167,6 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 		return ret;
 	}
 
-	ret = epoll_add(c->epollfd, EPOLLIN, &ref);
-	if (ret < 0)
-		return ret;
-
 	return fd;
 }
 
diff --git a/util.h b/util.h
index 8e4b4c5c6032..9108f8f19e14 100644
--- a/util.h
+++ b/util.h
@@ -207,7 +207,7 @@ struct ctx;
 
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	       const void *sa, socklen_t sl,
-	       const char *ifname, bool v6only, uint32_t data);
+	       const char *ifname, bool v6only);
 int sock_unix(char *sock_path);
 void sock_probe_mem(struct ctx *c);
 long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common
  2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
                   ` (2 preceding siblings ...)
  2025-10-17 10:31 ` [PATCH v4 3/7] util: Move epoll registration out of sock_l4_sa() Laurent Vivier
@ 2025-10-17 10:31 ` Laurent Vivier
  2025-10-17 17:43   ` Stefano Brivio
  2025-10-20  1:34   ` David Gibson
  2025-10-17 10:31 ` [PATCH v4 5/7] icmp: Use thread-based epoll management for ICMP flows Laurent Vivier
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
whether a connection was registered with epoll, not which epoll instance.
This limited flexibility for future multi-epoll support.

Replace the boolean with a threadnb field in flow_common that identifies
which thread (and thus which epoll instance) the flow is registered with.
Use FLOW_THREADNB_INVALID to indicate when a flow is not registered with
any epoll instance. A threadnb_to_epollfd[] mapping table translates
thread numbers to their corresponding epoll file descriptors.

Add helper functions:
- flow_in_epoll() to check if a flow is registered with epoll
- flow_epollfd() to retrieve the epoll fd for a flow's thread
- flow_thread_register() to register an epoll fd with a thread
- flow_thread_set() to set the thread number of a flow

This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
the need to pass the context 'c', since the epoll fd is now directly
accessible from the flow structure via flow_epollfd().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 flow.c       | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 flow.h       | 12 ++++++++++++
 passt.c      |  1 +
 tcp.c        | 39 ++++++++++++++++++++------------------
 tcp_conn.h   |  8 +-------
 tcp_splice.c | 24 ++++++++++++------------
 6 files changed, 99 insertions(+), 38 deletions(-)

diff --git a/flow.c b/flow.c
index b14e9d8b63ff..d56bae776239 100644
--- a/flow.c
+++ b/flow.c
@@ -116,6 +116,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 unsigned flow_first_free;
 union flow flowtab[FLOW_MAX];
 static const union flow *flow_new_entry; /* = NULL */
+static int threadnb_to_epollfd[FLOW_THREADNB_SIZE];
 
 /* Hash table to index it */
 #define FLOW_HASH_LOAD		70		/* % */
@@ -347,6 +348,55 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
 	flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
 }
 
+/**
+ * flow_in_epoll() - Check if flow is registered with an epoll instance
+ * @f:		Flow to check
+ *
+ * Return: true if flow is registered with epoll, false otherwise
+ */
+bool flow_in_epoll(const struct flow_common *f)
+{
+	return f->threadnb != FLOW_THREADNB_INVALID;
+}
+
+/**
+ * flow_epollfd() - Get the epoll file descriptor for a flow
+ * @f:		Flow to query
+ *
+ * Return: epoll file descriptor associated with the flow's thread
+ */
+int flow_epollfd(const struct flow_common *f)
+{
+	ASSERT(f->threadnb < FLOW_THREADNB_MAX);
+
+	return threadnb_to_epollfd[f->threadnb];
+}
+
+/**
+ * flow_thread_set() - Associate a flow with a thread
+ * @f:		Flow to update
+ * @threadnb:	Thread number to associate with this flow
+ */
+void flow_thread_set(struct flow_common *f, int threadnb)
+{
+	ASSERT(threadnb < FLOW_THREADNB_MAX);
+
+	f->threadnb = threadnb;
+}
+
+/**
+ * flow_thread_register() - Initialize the threadnb -> epollfd mapping
+ * @threadnb:	Thread number to associate to
+ * @epollfd:	epoll file descriptor for the thread
+ */
+void flow_thread_register(int threadnb, int epollfd)
+{
+	ASSERT(threadnb < FLOW_THREADNB_MAX);
+	ASSERT(epollfd >= 0);
+
+	threadnb_to_epollfd[threadnb] = epollfd;
+}
+
 /**
  * flow_initiate_() - Move flow to INI, setting pif[INISIDE]
  * @flow:	Flow to change state
@@ -548,6 +598,7 @@ union flow *flow_alloc(void)
 
 	flow_new_entry = flow;
 	memset(flow, 0, sizeof(*flow));
+	flow->f.threadnb = FLOW_THREADNB_INVALID;
 	flow_set_state(&flow->f, FLOW_STATE_NEW);
 
 	return flow;
@@ -827,7 +878,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 		case FLOW_TCP_SPLICE:
 			closed = tcp_splice_flow_defer(&flow->tcp_splice);
 			if (!closed && timer)
-				tcp_splice_timer(c, &flow->tcp_splice);
+				tcp_splice_timer(&flow->tcp_splice);
 			break;
 		case FLOW_PING4:
 		case FLOW_PING6:
diff --git a/flow.h b/flow.h
index ef138b83add8..700d8b32c990 100644
--- a/flow.h
+++ b/flow.h
@@ -177,6 +177,8 @@ int flowside_connect(const struct ctx *c, int s,
  * @type:	Type of packet flow
  * @pif[]:	Interface for each side of the flow
  * @side[]:	Information for each side of the flow
+ * @threadnb:	Thread number flow is registered with
+ *		(FLOW_THREADNB_INVALID if not)
  */
 struct flow_common {
 #ifdef __GNUC__
@@ -192,8 +194,14 @@ struct flow_common {
 #endif
 	uint8_t		pif[SIDES];
 	struct flowside	side[SIDES];
+#define FLOW_THREADNB_BITS 8
+	unsigned int	threadnb:FLOW_THREADNB_BITS;
 };
 
+#define FLOW_THREADNB_SIZE	(1 << FLOW_THREADNB_BITS)
+#define FLOW_THREADNB_MAX	(FLOW_THREADNB_SIZE - 1)
+#define FLOW_THREADNB_INVALID	FLOW_THREADNB_MAX
+
 #define FLOW_INDEX_BITS		17	/* 128k - 1 */
 #define FLOW_MAX		MAX_FROM_BITS(FLOW_INDEX_BITS)
 
@@ -249,6 +257,10 @@ flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
 union flow;
 
 void flow_init(void);
+bool flow_in_epoll(const struct flow_common *f);
+int flow_epollfd(const struct flow_common *f);
+void flow_thread_set(struct flow_common *f, int threadnb);
+void flow_thread_register(int threadnb, int epollfd);
 void flow_defer_handler(const struct ctx *c, const struct timespec *now);
 int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
 			      int fd);
diff --git a/passt.c b/passt.c
index af928111786b..37f2c897be84 100644
--- a/passt.c
+++ b/passt.c
@@ -285,6 +285,7 @@ int main(int argc, char **argv)
 	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
 	if (c.epollfd == -1)
 		die_perror("Failed to create epoll file descriptor");
+	flow_thread_register(0, c.epollfd);
 
 	if (getrlimit(RLIMIT_NOFILE, &limit))
 		die_perror("Failed to get maximum value of open files limit");
diff --git a/tcp.c b/tcp.c
index db9f17c0622f..8c49852b8454 100644
--- a/tcp.c
+++ b/tcp.c
@@ -504,25 +504,27 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
  */
 static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 {
-	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int m = flow_in_epoll(&conn->f) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
 		                .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), };
 	struct epoll_event ev = { .data.u64 = ref.u64 };
+	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
+						   : c->epollfd;
 
 	if (conn->events == CLOSED) {
-		if (conn->in_epoll)
-			epoll_del(c->epollfd, conn->sock);
+		if (flow_in_epoll(&conn->f))
+			epoll_del(epollfd, conn->sock);
 		if (conn->timer != -1)
-			epoll_del(c->epollfd, conn->timer);
+			epoll_del(epollfd, conn->timer);
 		return 0;
 	}
 
 	ev.events = tcp_conn_epoll_events(conn->events, conn->flags);
 
-	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
+	if (epoll_ctl(epollfd, m, conn->sock, &ev))
 		return -errno;
 
-	conn->in_epoll = true;
+	flow_thread_set(&conn->f, 0);
 
 	if (conn->timer != -1) {
 		union epoll_ref ref_t = { .type = EPOLL_TYPE_TCP_TIMER,
@@ -531,7 +533,8 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		struct epoll_event ev_t = { .data.u64 = ref_t.u64,
 					    .events = EPOLLIN | EPOLLET };
 
-		if (epoll_ctl(c->epollfd, EPOLL_CTL_MOD, conn->timer, &ev_t))
+		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_MOD,
+			      conn->timer, &ev_t))
 			return -errno;
 	}
 
@@ -540,12 +543,11 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 
 /**
  * tcp_timer_ctl() - Set timerfd based on flags/events, create timerfd if needed
- * @c:		Execution context
  * @conn:	Connection pointer
  *
  * #syscalls timerfd_create timerfd_settime
  */
-static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
+static void tcp_timer_ctl(struct tcp_tap_conn *conn)
 {
 	struct itimerspec it = { { 0 }, { 0 } };
 
@@ -570,7 +572,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		}
 		conn->timer = fd;
 
-		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
+		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_ADD,
+			      conn->timer, &ev)) {
 			flow_dbg_perror(conn, "failed to add timer");
 			close(conn->timer);
 			conn->timer = -1;
@@ -628,7 +631,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			 * flags and factor this into the logic below.
 			 */
 			if (flag == ACK_FROM_TAP_DUE)
-				tcp_timer_ctl(c, conn);
+				tcp_timer_ctl(conn);
 
 			return;
 		}
@@ -644,7 +647,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 	if (flag == ACK_FROM_TAP_DUE || flag == ACK_TO_TAP_DUE		  ||
 	    (flag == ~ACK_FROM_TAP_DUE && (conn->flags & ACK_TO_TAP_DUE)) ||
 	    (flag == ~ACK_TO_TAP_DUE   && (conn->flags & ACK_FROM_TAP_DUE)))
-		tcp_timer_ctl(c, conn);
+		tcp_timer_ctl(conn);
 }
 
 /**
@@ -699,7 +702,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp_epoll_ctl(c, conn);
 
 	if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
-		tcp_timer_ctl(c, conn);
+		tcp_timer_ctl(conn);
 }
 
 /**
@@ -1757,7 +1760,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 				   seq, conn->seq_from_tap);
 
 			tcp_send_flag(c, conn, ACK);
-			tcp_timer_ctl(c, conn);
+			tcp_timer_ctl(conn);
 
 			if (p->count == 1) {
 				tcp_tap_window_update(c, conn,
@@ -2406,7 +2409,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
 
 	if (conn->flags & ACK_TO_TAP_DUE) {
 		tcp_send_flag(c, conn, ACK_IF_NEEDED);
-		tcp_timer_ctl(c, conn);
+		tcp_timer_ctl(conn);
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		if (!(conn->events & ESTABLISHED)) {
 			flow_dbg(conn, "handshake timeout");
@@ -2428,7 +2431,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
 				return;
 
 			tcp_data_from_sock(c, conn);
-			tcp_timer_ctl(c, conn);
+			tcp_timer_ctl(conn);
 		}
 	} else {
 		struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
@@ -3476,7 +3479,7 @@ int tcp_flow_migrate_source_ext(const struct ctx *c,
 	if (c->migrate_no_linger)
 		close(s);
 	else
-		epoll_del(c->epollfd, s);
+		epoll_del(flow_epollfd(&conn->f), s);
 
 	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
 	 * based on the end of the queues.
@@ -3625,7 +3628,7 @@ static int tcp_flow_repair_connect(const struct ctx *c,
 		return rc;
 	}
 
-	conn->in_epoll = 0;
+	conn->f.threadnb = FLOW_THREADNB_INVALID;
 	conn->timer = -1;
 	conn->listening_sock = -1;
 
diff --git a/tcp_conn.h b/tcp_conn.h
index 38b5c541f003..81333122d531 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -12,7 +12,6 @@
 /**
  * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
  * @f:			Generic flow information
- * @in_epoll:		Is the connection in the epoll set?
  * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
  * @ws_from_tap:	Window scaling factor advertised from tap/guest
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
@@ -36,8 +35,6 @@ struct tcp_tap_conn {
 	/* Must be first element */
 	struct flow_common f;
 
-	bool		in_epoll	:1;
-
 #define TCP_RETRANS_BITS		3
 	unsigned int	retrans		:TCP_RETRANS_BITS;
 #define TCP_MAX_RETRANS			MAX_FROM_BITS(TCP_RETRANS_BITS)
@@ -196,7 +193,6 @@ struct tcp_tap_transfer_ext {
  * @written:		Bytes written (not fully written from one other side read)
  * @events:		Events observed/actions performed on connection
  * @flags:		Connection flags (attributes, not events)
- * @in_epoll:		Is the connection in the epoll set?
  */
 struct tcp_splice_conn {
 	/* Must be first element */
@@ -220,8 +216,6 @@ struct tcp_splice_conn {
 #define RCVLOWAT_SET(sidei_)		((sidei_) ? BIT(1) : BIT(0))
 #define RCVLOWAT_ACT(sidei_)		((sidei_) ? BIT(3) : BIT(2))
 #define CLOSING				BIT(4)
-
-	bool in_epoll	:1;
 };
 
 /* Socket pools */
@@ -245,7 +239,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
 bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
 
 bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
-void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
+void tcp_splice_timer(struct tcp_splice_conn *conn);
 int tcp_conn_pool_sock(int pool[]);
 int tcp_conn_sock(sa_family_t af);
 int tcp_sock_refill_pool(int pool[], sa_family_t af);
diff --git a/tcp_splice.c b/tcp_splice.c
index 6f21184bdc55..703bd7610890 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -149,7 +149,9 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
 static int tcp_splice_epoll_ctl(const struct ctx *c,
 				struct tcp_splice_conn *conn)
 {
-	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
+						   : c->epollfd;
+	int m = flow_in_epoll(&conn->f) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	const union epoll_ref ref[SIDES] = {
 		{ .type = EPOLL_TYPE_TCP_SPLICE, .fd = conn->s[0],
 		  .flowside = FLOW_SIDX(conn, 0) },
@@ -161,25 +163,24 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 
 	tcp_splice_conn_epoll_events(conn->events, ev);
 
-	if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) ||
-	    epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) {
+
+	if (epoll_ctl(epollfd, m, conn->s[0], &ev[0]) ||
+	    epoll_ctl(epollfd, m, conn->s[1], &ev[1])) {
 		int ret = -errno;
 		flow_perror(conn, "ERROR on epoll_ctl()");
 		return ret;
 	}
-
-	conn->in_epoll = true;
+	flow_thread_set(&conn->f, 0);
 
 	return 0;
 }
 
 /**
  * conn_flag_do() - Set/unset given flag, log, update epoll on CLOSING flag
- * @c:		Execution context
  * @conn:	Connection pointer
  * @flag:	Flag to set, or ~flag to unset
  */
-static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
+static void conn_flag_do(struct tcp_splice_conn *conn,
 			 unsigned long flag)
 {
 	if (flag & (flag - 1)) {
@@ -204,15 +205,15 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 	}
 
 	if (flag == CLOSING) {
-		epoll_del(c->epollfd, conn->s[0]);
-		epoll_del(c->epollfd, conn->s[1]);
+		epoll_del(flow_epollfd(&conn->f), conn->s[0]);
+		epoll_del(flow_epollfd(&conn->f), conn->s[1]);
 	}
 }
 
 #define conn_flag(c, conn, flag)					\
 	do {								\
 		flow_trace(conn, "flag at %s:%i", __func__, __LINE__);	\
-		conn_flag_do(c, conn, flag);				\
+		conn_flag_do(conn, flag);				\
 	} while (0)
 
 /**
@@ -751,10 +752,9 @@ void tcp_splice_init(struct ctx *c)
 
 /**
  * tcp_splice_timer() - Timer for spliced connections
- * @c:		Execution context
  * @conn:	Connection to handle
  */
-void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn)
+void tcp_splice_timer(struct tcp_splice_conn *conn)
 {
 	unsigned sidei;
 
-- 
2.51.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 5/7] icmp: Use thread-based epoll management for ICMP flows
  2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
                   ` (3 preceding siblings ...)
  2025-10-17 10:31 ` [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common Laurent Vivier
@ 2025-10-17 10:31 ` Laurent Vivier
  2025-10-20  1:35   ` David Gibson
  2025-10-17 10:31 ` [PATCH v4 6/7] udp: Use thread-based epoll management for UDP flows Laurent Vivier
  2025-10-17 10:31 ` [PATCH v4 7/7] passt: Move main event loop processing into passt_worker() Laurent Vivier
  6 siblings, 1 reply; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Store the thread number in the flow_common structure for ICMP ping
flows using flow_epollfd_set() and retrieve the corresponding epoll
file descriptor with flow_epollfd_get() instead of passing c->epollfd
directly. This makes ICMP consistent with the recent TCP changes and
follows the pattern established in previous commit.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 icmp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/icmp.c b/icmp.c
index 56dfac6c958e..baddd8e5aacb 100644
--- a/icmp.c
+++ b/icmp.c
@@ -149,7 +149,7 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	epoll_del(c->epollfd, pingf->sock);
+	epoll_del(flow_epollfd(&pingf->f), pingf->sock);
 	close(pingf->sock);
 	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
 }
@@ -206,11 +206,13 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (pingf->sock > FD_REF_MAX)
 		goto cancel;
 
+	flow_thread_set(&pingf->f, 0);
+
 	ref.type = EPOLL_TYPE_PING;
 	ref.flowside = FLOW_SIDX(flow, TGTSIDE);
 	ref.fd = pingf->sock;
 
-	if (epoll_add(c->epollfd, EPOLLIN, &ref) < 0) {
+	if (epoll_add(flow_epollfd(&pingf->f), EPOLLIN, &ref) < 0) {
 		close(pingf->sock);
 		goto cancel;
 	}
-- 
2.51.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 6/7] udp: Use thread-based epoll management for UDP flows
  2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
                   ` (4 preceding siblings ...)
  2025-10-17 10:31 ` [PATCH v4 5/7] icmp: Use thread-based epoll management for ICMP flows Laurent Vivier
@ 2025-10-17 10:31 ` Laurent Vivier
  2025-10-20  1:39   ` David Gibson
  2025-10-17 10:31 ` [PATCH v4 7/7] passt: Move main event loop processing into passt_worker() Laurent Vivier
  6 siblings, 1 reply; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Store the thread number in the flow_common structure for UDP
flows using flow_epollfd_set() and retrieve the corresponding epoll
file descriptor with flow_epollfd_get() instead of passing c->epollfd
directly. This makes UDP consistent with the recent TCP and ICMP
changes.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 udp_flow.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/udp_flow.c b/udp_flow.c
index caaf3dababb1..ea89a14775a4 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -52,7 +52,7 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
 	flow_foreach_sidei(sidei) {
 		flow_hash_remove(c, FLOW_SIDX(uflow, sidei));
 		if (uflow->s[sidei] >= 0) {
-			epoll_del(c->epollfd, uflow->s[sidei]);
+			epoll_del(flow_epollfd(&uflow->f), uflow->s[sidei]);
 			close(uflow->s[sidei]);
 			uflow->s[sidei] = -1;
 		}
@@ -92,7 +92,9 @@ static int udp_flow_sock(const struct ctx *c,
 	ref.data = fref.data;
 	ref.fd = s;
 
-	rc = epoll_add(c->epollfd, EPOLLIN, &ref);
+	flow_thread_set(&uflow->f, 0);
+
+	rc = epoll_add(flow_epollfd(&uflow->f), EPOLLIN, &ref);
 	if (rc < 0) {
 		close(s);
 		return rc;
@@ -101,7 +103,7 @@ static int udp_flow_sock(const struct ctx *c,
 	if (flowside_connect(c, s, pif, side) < 0) {
 		rc = -errno;
 
-		epoll_del(c->epollfd, s);
+		epoll_del(flow_epollfd(&uflow->f), s);
 		close(s);
 
 		flow_dbg_perror(uflow, "Couldn't connect flow socket");
-- 
2.51.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 7/7] passt: Move main event loop processing into passt_worker()
  2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
                   ` (5 preceding siblings ...)
  2025-10-17 10:31 ` [PATCH v4 6/7] udp: Use thread-based epoll management for UDP flows Laurent Vivier
@ 2025-10-17 10:31 ` Laurent Vivier
  2025-10-17 17:43   ` Stefano Brivio
  2025-10-20  1:43   ` David Gibson
  6 siblings, 2 replies; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 10:31 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Extract the epoll event processing logic from main() into a separate
passt_worker() function. This refactoring prepares the code for future
threading support where passt_worker() will be called as a worker thread
callback.

The new function handles:
- Processing epoll events and dispatching to protocol handlers
- Event statistics tracking and printing
- Post-handler periodic tasks (timers, deferred work)
- Migration handling

No functional changes, purely a code restructuring.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 passt.c | 160 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 88 insertions(+), 72 deletions(-)

diff --git a/passt.c b/passt.c
index 37f2c897be84..5bfa4c6353d9 100644
--- a/passt.c
+++ b/passt.c
@@ -229,6 +229,92 @@ static void print_stats(const struct ctx *c, const struct passt_stats *stats,
 	lines_printed++;
 }
 
+/**
+ * passt_worker() - Process epoll events and handle protocol operations
+ * @opaque:	Pointer to execution context (struct ctx)
+ * @nfds:	Number of file descriptors ready (epoll_wait return value)
+ * @events:	epoll_event array of ready file descriptors
+ */
+static void passt_worker(void *opaque, int nfds,  struct epoll_event *events)
+{
+	static struct passt_stats stats = { 0 };
+	struct ctx *c = opaque;
+	struct timespec now;
+	int i;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &now))
+		err_perror("Failed to get CLOCK_MONOTONIC time");
+
+	for (i = 0; i < nfds; i++) {
+		union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
+		uint32_t eventmask = events[i].events;
+
+		trace("%s: epoll event on %s %i (events: 0x%08x)",
+		      c->mode == MODE_PASTA ? "pasta" : "passt",
+		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
+
+		switch (ref.type) {
+		case EPOLL_TYPE_TAP_PASTA:
+			tap_handler_pasta(c, eventmask, &now);
+			break;
+		case EPOLL_TYPE_TAP_PASST:
+			tap_handler_passt(c, eventmask, &now);
+			break;
+		case EPOLL_TYPE_TAP_LISTEN:
+			tap_listen_handler(c, eventmask);
+			break;
+		case EPOLL_TYPE_NSQUIT_INOTIFY:
+			pasta_netns_quit_inotify_handler(c, ref.fd);
+			break;
+		case EPOLL_TYPE_NSQUIT_TIMER:
+			pasta_netns_quit_timer_handler(c, ref);
+			break;
+		case EPOLL_TYPE_TCP:
+			tcp_sock_handler(c, ref, eventmask);
+			break;
+		case EPOLL_TYPE_TCP_SPLICE:
+			tcp_splice_sock_handler(c, ref, eventmask);
+			break;
+		case EPOLL_TYPE_TCP_LISTEN:
+			tcp_listen_handler(c, ref, &now);
+			break;
+		case EPOLL_TYPE_TCP_TIMER:
+			tcp_timer_handler(c, ref);
+			break;
+		case EPOLL_TYPE_UDP_LISTEN:
+			udp_listen_sock_handler(c, ref, eventmask, &now);
+			break;
+		case EPOLL_TYPE_UDP:
+			udp_sock_handler(c, ref, eventmask, &now);
+			break;
+		case EPOLL_TYPE_PING:
+			icmp_sock_handler(c, ref);
+			break;
+		case EPOLL_TYPE_VHOST_CMD:
+			vu_control_handler(c->vdev, c->fd_tap, eventmask);
+			break;
+		case EPOLL_TYPE_VHOST_KICK:
+			vu_kick_cb(c->vdev, ref, &now);
+			break;
+		case EPOLL_TYPE_REPAIR_LISTEN:
+			repair_listen_handler(c, eventmask);
+			break;
+		case EPOLL_TYPE_REPAIR:
+			repair_handler(c, eventmask);
+			break;
+		default:
+			/* Can't happen */
+			ASSERT(0);
+		}
+		stats.events[ref.type]++;
+		print_stats(c, &stats, &now);
+	}
+
+	post_handler(c, &now);
+
+	migrate_handler(c);
+}
+
 /**
  * main() - Entry point and main loop
  * @argc:	Argument count
@@ -246,8 +332,7 @@ static void print_stats(const struct ctx *c, const struct passt_stats *stats,
 int main(int argc, char **argv)
 {
 	struct epoll_event events[NUM_EPOLL_EVENTS];
-	struct passt_stats stats = { 0 };
-	int nfds, i, devnull_fd = -1;
+	int nfds, devnull_fd = -1;
 	struct ctx c = { 0 };
 	struct rlimit limit;
 	struct timespec now;
@@ -355,77 +440,8 @@ loop:
 	if (nfds == -1 && errno != EINTR)
 		die_perror("epoll_wait() failed in main loop");
 
-	if (clock_gettime(CLOCK_MONOTONIC, &now))
-		err_perror("Failed to get CLOCK_MONOTONIC time");
-
-	for (i = 0; i < nfds; i++) {
-		union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
-		uint32_t eventmask = events[i].events;
-
-		trace("%s: epoll event on %s %i (events: 0x%08x)",
-		      c.mode == MODE_PASTA ? "pasta" : "passt",
-		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
-
-		switch (ref.type) {
-		case EPOLL_TYPE_TAP_PASTA:
-			tap_handler_pasta(&c, eventmask, &now);
-			break;
-		case EPOLL_TYPE_TAP_PASST:
-			tap_handler_passt(&c, eventmask, &now);
-			break;
-		case EPOLL_TYPE_TAP_LISTEN:
-			tap_listen_handler(&c, eventmask);
-			break;
-		case EPOLL_TYPE_NSQUIT_INOTIFY:
-			pasta_netns_quit_inotify_handler(&c, ref.fd);
-			break;
-		case EPOLL_TYPE_NSQUIT_TIMER:
-			pasta_netns_quit_timer_handler(&c, ref);
-			break;
-		case EPOLL_TYPE_TCP:
-			tcp_sock_handler(&c, ref, eventmask);
-			break;
-		case EPOLL_TYPE_TCP_SPLICE:
-			tcp_splice_sock_handler(&c, ref, eventmask);
-			break;
-		case EPOLL_TYPE_TCP_LISTEN:
-			tcp_listen_handler(&c, ref, &now);
-			break;
-		case EPOLL_TYPE_TCP_TIMER:
-			tcp_timer_handler(&c, ref);
-			break;
-		case EPOLL_TYPE_UDP_LISTEN:
-			udp_listen_sock_handler(&c, ref, eventmask, &now);
-			break;
-		case EPOLL_TYPE_UDP:
-			udp_sock_handler(&c, ref, eventmask, &now);
-			break;
-		case EPOLL_TYPE_PING:
-			icmp_sock_handler(&c, ref);
-			break;
-		case EPOLL_TYPE_VHOST_CMD:
-			vu_control_handler(c.vdev, c.fd_tap, eventmask);
-			break;
-		case EPOLL_TYPE_VHOST_KICK:
-			vu_kick_cb(c.vdev, ref, &now);
-			break;
-		case EPOLL_TYPE_REPAIR_LISTEN:
-			repair_listen_handler(&c, eventmask);
-			break;
-		case EPOLL_TYPE_REPAIR:
-			repair_handler(&c, eventmask);
-			break;
-		default:
-			/* Can't happen */
-			ASSERT(0);
-		}
-		stats.events[ref.type]++;
-		print_stats(&c, &stats, &now);
-	}
-
-	post_handler(&c, &now);
 
-	migrate_handler(&c);
+	passt_worker(&c, nfds, events);
 
 	goto loop;
 }
-- 
2.51.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/7] epoll_ctl: Extract epoll operations
  2025-10-17 10:31 ` [PATCH v4 2/7] epoll_ctl: Extract epoll operations Laurent Vivier
@ 2025-10-17 11:48   ` Stefano Brivio
  2025-10-17 12:21     ` Laurent Vivier
  2025-10-20  1:20   ` David Gibson
  1 sibling, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2025-10-17 11:48 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

I was reviewing v3 but I can seamlessly move on to v4, just two things
here:

On Fri, 17 Oct 2025 12:31:24 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> Centralize epoll_add() and epoll_del() helper functions into new
> epoll_ctl.c/h files.
> 
> This also moves the union epoll_ref definition from passt.h to
> epoll_ctl.h where it's more logically placed.
> 
> The new epoll_add() helper simplifies adding file descriptors to epoll
> by taking an epoll_ref and events, handling error reporting
> consistently across all call sites.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     | 22 +++++++++++-----------
>  epoll_ctl.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  epoll_ctl.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  icmp.c       |  4 +---
>  passt.c      |  2 +-
>  passt.h      | 34 ----------------------------------
>  pasta.c      |  7 +++----
>  repair.c     | 14 +++++---------
>  tap.c        | 13 ++++---------
>  tcp.c        |  2 +-
>  tcp_splice.c |  2 +-
>  udp.c        |  2 +-
>  udp_flow.c   |  1 +
>  util.c       | 22 +++-------------------
>  util.h       |  4 +++-
>  vhost_user.c |  8 ++------
>  vu_common.c  |  2 +-
>  17 files changed, 134 insertions(+), 101 deletions(-)
>  create mode 100644 epoll_ctl.c
>  create mode 100644 epoll_ctl.h
> 
> diff --git a/Makefile b/Makefile
> index 3328f8324140..91e037b8fd3c 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -37,23 +37,23 @@ FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
>  FLAGS += -DVERSION=\"$(VERSION)\"
>  FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  
> -PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> -	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
> -	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
> -	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
> +PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c epoll_ctl.c \
> +	flow.c fwd.c icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c \
> +	log.c mld.c ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c \
> +	pif.c repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c \
> +	udp_flow.c udp_vu.c util.c vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  PASST_REPAIR_SRCS = passt-repair.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
>  
>  MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
>  
> -PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> -	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> -	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> -	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
> -	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
> -	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
> +PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h epoll_ctl.h \
> +	flow.h fwd.h flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h \
> +	isolation.h lineread.h log.h migrate.h ndp.h netlink.h packet.h \
> +	passt.h pasta.h pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h \
> +	tcp_conn.h tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h \
> +	udp_internal.h udp_vu.h util.h vhost_user.h virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> diff --git a/epoll_ctl.c b/epoll_ctl.c
> new file mode 100644
> index 000000000000..7a520560aeb9
> --- /dev/null
> +++ b/epoll_ctl.c
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* epoll_ctl.c - epoll manipulation helpers
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#include <errno.h>
> +
> +#include "epoll_ctl.h"
> +
> +/**
> + * epoll_add() - Add a file descriptor to an epollfd
> + * @epollfd:	epoll file descriptor to add to
> + * @events:	epoll events
> + * @ref:	epoll reference for the file descriptor (includes fd and metadata)
> + *
> + * Return: 0 on success, negative errno on failure
> + */
> +int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref)
> +{
> +	struct epoll_event ev;
> +	int ret;
> +
> +	ev.events = events;
> +	ev.data.u64 = ref->u64;
> +
> +	ret = epoll_ctl(epollfd, EPOLL_CTL_ADD, ref->fd, &ev);
> +	if (ret == -1) {
> +		ret = -errno;
> +		err("Failed to add fd to epoll: %s", strerror_(-ret));
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * epoll_del() - Remove a file descriptor from an epollfd
> + * @epollfd:	epoll file descriptor to remove from
> + * @fd:		File descriptor to remove
> + */
> +void epoll_del(int epollfd, int fd)
> +{
> +	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
> +}
> diff --git a/epoll_ctl.h b/epoll_ctl.h
> new file mode 100644
> index 000000000000..cf92b0f63f26
> --- /dev/null
> +++ b/epoll_ctl.h
> @@ -0,0 +1,51 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef EPOLL_CTL_H
> +#define EPOLL_CTL_H
> +
> +#include <sys/epoll.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "epoll_type.h"
> +#include "flow.h"
> +#include "tcp.h"
> +#include "udp.h"
> +
> +/**
> + * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
> + * @type:	Type of fd (tells us what to do with events)
> + * @fd:		File descriptor number (implies < 2^24 total descriptors)
> + * @flow:	Index of the flow this fd is linked to
> + * @tcp_listen:	TCP-specific reference part for listening sockets
> + * @udp:	UDP-specific reference part
> + * @data:	Data handled by protocol handlers
> + * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
> + * @queue:	vhost-user queue index for this fd
> + * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
> + */
> +union epoll_ref {
> +	struct {
> +		enum epoll_type type:8;
> +		int32_t		  fd:FD_REF_BITS;

Same comment as for v3, quoting:

---
Do the definitions of FD_REF_BITS and FD_REF_MAX really belong to
util.h? It would be clearer to have them here. I didn't check all the
possible places where you would need to include this header though.
---

> +		union {
> +			uint32_t flow;
> +			flow_sidx_t flowside;
> +			union tcp_listen_epoll_ref tcp_listen;
> +			union udp_listen_epoll_ref udp;
> +			uint32_t data;
> +			int nsdir_fd;
> +			int queue;
> +		};
> +	};
> +	uint64_t u64;
> +};
> +static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
> +	      "epoll_ref must have same size as epoll_data");

And for this one, I just realised, after this exchange with David:

---
> Either the comment is misleading, or it should be sizeof(...) != ...  

I believe <= is correct, but the comment is misleading.  What we care
about here is that our custom structure/union fits in space available
in the epoll_event structure.  In practice it will have equal size,
but <= is what we care about.
---

that you're just moving it around in this patch, so, okay, fine.

> +int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref);
> +void epoll_del(int epollfd, int fd);
> +#endif /* EPOLL_CTL_H */
> diff --git a/icmp.c b/icmp.c
> index bd3108a21675..c26561da80bf 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -15,7 +15,6 @@
>  #include <errno.h>
>  #include <net/ethernet.h>
>  #include <net/if.h>
> -#include <netinet/in.h>
>  #include <netinet/ip.h>
>  #include <netinet/ip_icmp.h>
>  #include <stdio.h>
> @@ -23,10 +22,8 @@
>  #include <stdint.h>
>  #include <stddef.h>
>  #include <string.h>
> -#include <sys/epoll.h>
>  #include <sys/types.h>
>  #include <sys/socket.h>
> -#include <unistd.h>
>  #include <time.h>
>  
>  #include <linux/icmpv6.h>
> @@ -41,6 +38,7 @@
>  #include "inany.h"
>  #include "icmp.h"
>  #include "flow_table.h"
> +#include "epoll_ctl.h"
>  
>  #define ICMP_ECHO_TIMEOUT	60 /* s, timeout for ICMP socket activity */
>  #define ICMP_NUM_IDS		(1U << 16)
> diff --git a/passt.c b/passt.c
> index bdb7b6935f0c..af928111786b 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -19,7 +19,6 @@
>   * created in a separate network namespace).
>   */
>  
> -#include <sys/epoll.h>
>  #include <fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/resource.h>
> @@ -53,6 +52,7 @@
>  #include "vu_common.h"
>  #include "migrate.h"
>  #include "repair.h"
> +#include "epoll_ctl.h"
>  
>  #define NUM_EPOLL_EVENTS	8
>  
> diff --git a/passt.h b/passt.h
> index 0075eb4b3b16..befe56bb167b 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -35,40 +35,6 @@ union epoll_ref;
>  #define MAC_OUR_LAA	\
>  	((uint8_t [ETH_ALEN]){0x9a, 0x55, 0x9a, 0x55, 0x9a, 0x55})
>  
> -/**
> - * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
> - * @type:	Type of fd (tells us what to do with events)
> - * @fd:		File descriptor number (implies < 2^24 total descriptors)
> - * @flow:	Index of the flow this fd is linked to
> - * @tcp_listen:	TCP-specific reference part for listening sockets
> - * @udp:	UDP-specific reference part
> - * @icmp:	ICMP-specific reference part
> - * @data:	Data handled by protocol handlers
> - * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
> - * @queue:	vhost-user queue index for this fd
> - * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
> - */
> -union epoll_ref {
> -	struct {
> -		enum epoll_type type:8;
> -#define FD_REF_BITS		24
> -#define FD_REF_MAX		((int)MAX_FROM_BITS(FD_REF_BITS))
> -		int32_t		fd:FD_REF_BITS;
> -		union {
> -			uint32_t flow;
> -			flow_sidx_t flowside;
> -			union tcp_listen_epoll_ref tcp_listen;
> -			union udp_listen_epoll_ref udp;
> -			uint32_t data;
> -			int nsdir_fd;
> -			int queue;
> -		};
> -	};
> -	uint64_t u64;
> -};
> -static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
> -	      "epoll_ref must have same size as epoll_data");
> -
>  /* Large enough for ~128 maximum size frames */
>  #define PKT_BUF_BYTES		(8UL << 20)
>  
> diff --git a/pasta.c b/pasta.c
> index 687406b6e736..e905f6d33b95 100644
> --- a/pasta.c
> +++ b/pasta.c
> @@ -27,7 +27,6 @@
>  #include <stdint.h>
>  #include <unistd.h>
>  #include <syslog.h>
> -#include <sys/epoll.h>
>  #include <sys/inotify.h>
>  #include <sys/mount.h>
>  #include <sys/timerfd.h>
> @@ -49,6 +48,7 @@
>  #include "isolation.h"
>  #include "netlink.h"
>  #include "log.h"
> +#include "epoll_ctl.h"
>  
>  #define HOSTNAME_PREFIX		"pasta-"
>  
> @@ -444,7 +444,6 @@ static int pasta_netns_quit_timer(void)
>   */
>  void pasta_netns_quit_init(const struct ctx *c)
>  {
> -	struct epoll_event ev = { .events = EPOLLIN };
>  	int flags = O_NONBLOCK | O_CLOEXEC;
>  	struct statfs s = { 0 };
>  	bool try_inotify = true;
> @@ -487,8 +486,8 @@ void pasta_netns_quit_init(const struct ctx *c)
>  		die("netns monitor file number %i too big, exiting", fd);
>  
>  	ref.fd = fd;
> -	ev.data.u64 = ref.u64;
> -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev);
> +
> +	epoll_add(c->epollfd, EPOLLIN, &ref);
>  }
>  
>  /**
> diff --git a/repair.c b/repair.c
> index f6b1bf36479c..c8f4737fa62a 100644
> --- a/repair.c
> +++ b/repair.c
> @@ -22,6 +22,7 @@
>  #include "inany.h"
>  #include "flow.h"
>  #include "flow_table.h"
> +#include "epoll_ctl.h"
>  
>  #include "repair.h"
>  
> @@ -47,7 +48,6 @@ static int repair_nfds;
>  void repair_sock_init(const struct ctx *c)
>  {
>  	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
> -	struct epoll_event ev = { 0 };
>  
>  	if (c->fd_repair_listen == -1)
>  		return;
> @@ -58,9 +58,7 @@ void repair_sock_init(const struct ctx *c)
>  	}
>  
>  	ref.fd = c->fd_repair_listen;
> -	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
> -	ev.data.u64 = ref.u64;
> -	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
> +	if (epoll_add(c->epollfd, EPOLLIN | EPOLLHUP | EPOLLET, &ref))
>  		err_perror("repair helper socket epoll_ctl(), won't migrate");
>  }
>  
> @@ -74,7 +72,6 @@ void repair_sock_init(const struct ctx *c)
>  int repair_listen_handler(struct ctx *c, uint32_t events)
>  {
>  	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
> -	struct epoll_event ev = { 0 };
>  	struct ucred ucred;
>  	socklen_t len;
>  	int rc;
> @@ -112,10 +109,9 @@ int repair_listen_handler(struct ctx *c, uint32_t events)
>  		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
>  
>  	ref.fd = c->fd_repair;
> -	ev.events = EPOLLHUP | EPOLLET;
> -	ev.data.u64 = ref.u64;
> -	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
> -		rc = errno;
> +
> +	rc = epoll_add(c->epollfd, EPOLLHUP | EPOLLET, &ref);
> +	if (rc < 0) {
>  		debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
>  		close(c->fd_repair);
>  		c->fd_repair = -1;
> diff --git a/tap.c b/tap.c
> index 9812f120d426..5b9403dce25b 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -26,7 +26,6 @@
>  #include <netinet/in.h>
>  #include <arpa/inet.h>
>  #include <stdint.h>
> -#include <sys/epoll.h>
>  #include <sys/ioctl.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
> @@ -61,6 +60,7 @@
>  #include "log.h"
>  #include "vhost_user.h"
>  #include "vu_common.h"
> +#include "epoll_ctl.h"
>  
>  /* Maximum allowed frame lengths (including L2 header) */
>  
> @@ -1327,14 +1327,12 @@ static void tap_backend_show_hints(struct ctx *c)
>  static void tap_sock_unix_init(const struct ctx *c)
>  {
>  	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
> -	struct epoll_event ev = { 0 };
>  
>  	listen(c->fd_tap_listen, 0);
>  
>  	ref.fd = c->fd_tap_listen;
> -	ev.events = EPOLLIN | EPOLLET;
> -	ev.data.u64 = ref.u64;
> -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
> +
> +	epoll_add(c->epollfd, EPOLLIN | EPOLLET, &ref);
>  }
>  
>  /**
> @@ -1343,7 +1341,6 @@ static void tap_sock_unix_init(const struct ctx *c)
>   */
>  static void tap_start_connection(const struct ctx *c)
>  {
> -	struct epoll_event ev = { 0 };
>  	union epoll_ref ref = { 0 };
>  
>  	ref.fd = c->fd_tap;
> @@ -1359,9 +1356,7 @@ static void tap_start_connection(const struct ctx *c)
>  		break;
>  	}
>  
> -	ev.events = EPOLLIN | EPOLLRDHUP;
> -	ev.data.u64 = ref.u64;
> -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> +	epoll_add(c->epollfd, EPOLLIN | EPOLLRDHUP, &ref);
>  
>  	if (c->ifi4)
>  		arp_send_init_req(c);
> diff --git a/tcp.c b/tcp.c
> index 745353f782f5..db9f17c0622f 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -279,7 +279,6 @@
>  #include <stdbool.h>
>  #include <stddef.h>
>  #include <string.h>
> -#include <sys/epoll.h>
>  #include <sys/ioctl.h>
>  #include <sys/socket.h>
>  #include <sys/timerfd.h>
> @@ -309,6 +308,7 @@
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
>  #include "tcp_vu.h"
> +#include "epoll_ctl.h"
>  
>  /*
>   * The size of TCP header (including options) is given by doff (Data Offset)
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 666ee62b738f..6f21184bdc55 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -44,7 +44,6 @@
>  #include <net/ethernet.h>
>  #include <netinet/in.h>
>  #include <netinet/tcp.h>
> -#include <sys/epoll.h>
>  #include <sys/types.h>
>  #include <sys/socket.h>
>  
> @@ -56,6 +55,7 @@
>  #include "siphash.h"
>  #include "inany.h"
>  #include "flow.h"
> +#include "epoll_ctl.h"
>  
>  #include "flow_table.h"
>  
> diff --git a/udp.c b/udp.c
> index 86585b7e0942..3812d5c2336f 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -94,7 +94,6 @@
>  #include <stdint.h>
>  #include <stddef.h>
>  #include <string.h>
> -#include <sys/epoll.h>
>  #include <sys/types.h>
>  #include <sys/socket.h>
>  #include <sys/uio.h>
> @@ -115,6 +114,7 @@
>  #include "flow_table.h"
>  #include "udp_internal.h"
>  #include "udp_vu.h"
> +#include "epoll_ctl.h"
>  
>  #define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
>  
> diff --git a/udp_flow.c b/udp_flow.c
> index 84973f807167..d9c75f1bb1d8 100644
> --- a/udp_flow.c
> +++ b/udp_flow.c
> @@ -15,6 +15,7 @@
>  #include "passt.h"
>  #include "flow_table.h"
>  #include "udp_internal.h"
> +#include "epoll_ctl.h"
>  
>  #define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
>  
> diff --git a/util.c b/util.c
> index 1067486be414..b2490123590a 100644
> --- a/util.c
> +++ b/util.c
> @@ -18,7 +18,6 @@
>  #include <unistd.h>
>  #include <arpa/inet.h>
>  #include <net/ethernet.h>
> -#include <sys/epoll.h>
>  #include <sys/uio.h>
>  #include <fcntl.h>
>  #include <string.h>
> @@ -35,6 +34,7 @@
>  #include "packet.h"
>  #include "log.h"
>  #include "pcap.h"
> +#include "epoll_ctl.h"
>  #ifdef HAS_GETRANDOM
>  #include <sys/random.h>
>  #endif
> @@ -58,7 +58,6 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  	sa_family_t af = ((const struct sockaddr *)sa)->sa_family;
>  	union epoll_ref ref = { .type = type, .data = data };
>  	bool freebind = false;
> -	struct epoll_event ev;
>  	int fd, y = 1, ret;
>  	uint8_t proto;
>  	int socktype;
> @@ -172,13 +171,9 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  		return ret;
>  	}
>  
> -	ev.events = EPOLLIN;
> -	ev.data.u64 = ref.u64;
> -	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
> -		ret = -errno;
> -		warn("L4 epoll_ctl: %s", strerror_(-ret));
> +	ret = epoll_add(c->epollfd, EPOLLIN, &ref);
> +	if (ret < 0)
>  		return ret;
> -	}
>  
>  	return fd;
>  }
> @@ -994,17 +989,6 @@ void raw_random(void *buf, size_t buflen)
>  		die("Unexpected EOF on random data source");
>  }
>  
> -/**
> - * epoll_del() - Remove a file descriptor from our passt epoll
> - * @epollfd:	epoll file descriptor to remove from
> - * @fd:		File descriptor to remove
> - */
> -void epoll_del(int epollfd, int fd)
> -{
> -	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
> -
> -}
> -
>  /**
>   * encode_domain_name() - Encode domain name according to RFC 1035, section 3.1
>   * @buf:		Buffer to fill in with encoded domain name
> diff --git a/util.h b/util.h
> index c61cbef357aa..8e4b4c5c6032 100644
> --- a/util.h
> +++ b/util.h
> @@ -193,6 +193,9 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  #define SNDBUF_BIG		(4ULL * 1024 * 1024)
>  #define SNDBUF_SMALL		(128ULL * 1024)
>  
> +#define FD_REF_BITS		24
> +#define FD_REF_MAX		((int)MAX_FROM_BITS(FD_REF_BITS))
> +
>  #include <net/if.h>
>  #include <limits.h>
>  #include <stdint.h>
> @@ -300,7 +303,6 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m)
>  #define FPRINTF(f, ...)	(void)fprintf(f, __VA_ARGS__)
>  
>  void raw_random(void *buf, size_t buflen);
> -void epoll_del(int epollfd, int fd);
>  
>  /*
>   * Starting from glibc 2.40.9000 and commit 25a5eb4010df ("string: strerror,
> diff --git a/vhost_user.c b/vhost_user.c
> index f8324c59cc6c..aea1e2cbcea5 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -32,8 +32,6 @@
>  #include <inttypes.h>
>  #include <time.h>
>  #include <net/ethernet.h>
> -#include <netinet/in.h>
> -#include <sys/epoll.h>
>  #include <sys/eventfd.h>
>  #include <sys/mman.h>
>  #include <linux/vhost_types.h>
> @@ -45,6 +43,7 @@
>  #include "vhost_user.h"
>  #include "pcap.h"
>  #include "migrate.h"
> +#include "epoll_ctl.h"
>  
>  /* vhost-user version we are compatible with */
>  #define VHOST_USER_VERSION 1
> @@ -753,11 +752,8 @@ static void vu_set_watch(const struct vu_dev *vdev, int idx)
>  		.fd = vdev->vq[idx].kick_fd,
>  		.queue = idx
>  	 };
> -	struct epoll_event ev = { 0 };
>  
> -	ev.data.u64 = ref.u64;
> -	ev.events = EPOLLIN;
> -	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
> +	epoll_add(vdev->context->epollfd, EPOLLIN, &ref);
>  }
>  
>  /**
> diff --git a/vu_common.c b/vu_common.c
> index b716070ea3c3..b13b7c308fd8 100644
> --- a/vu_common.c
> +++ b/vu_common.c
> @@ -6,7 +6,6 @@
>   */
>  
>  #include <errno.h>
> -#include <unistd.h>
>  #include <sys/uio.h>
>  #include <sys/eventfd.h>
>  #include <netinet/if_ether.h>
> @@ -19,6 +18,7 @@
>  #include "pcap.h"
>  #include "vu_common.h"
>  #include "migrate.h"
> +#include "epoll_ctl.h"
>  
>  #define VU_MAX_TX_BUFFER_NB	2
>  

-- 
Stefano


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/7] epoll_ctl: Extract epoll operations
  2025-10-17 11:48   ` Stefano Brivio
@ 2025-10-17 12:21     ` Laurent Vivier
  2025-10-17 13:05       ` Stefano Brivio
  0 siblings, 1 reply; 27+ messages in thread
From: Laurent Vivier @ 2025-10-17 12:21 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

On 17/10/2025 13:48, Stefano Brivio wrote:
> I was reviewing v3 but I can seamlessly move on to v4, just two things
> here:
> 
> On Fri, 17 Oct 2025 12:31:24 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> Centralize epoll_add() and epoll_del() helper functions into new
>> epoll_ctl.c/h files.
>>
>> This also moves the union epoll_ref definition from passt.h to
>> epoll_ctl.h where it's more logically placed.
>>
>> The new epoll_add() helper simplifies adding file descriptors to epoll
>> by taking an epoll_ref and events, handling error reporting
>> consistently across all call sites.
>>
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>> ---
>>   Makefile     | 22 +++++++++++-----------
>>   epoll_ctl.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   epoll_ctl.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   icmp.c       |  4 +---
>>   passt.c      |  2 +-
>>   passt.h      | 34 ----------------------------------
>>   pasta.c      |  7 +++----
>>   repair.c     | 14 +++++---------
>>   tap.c        | 13 ++++---------
>>   tcp.c        |  2 +-
>>   tcp_splice.c |  2 +-
>>   udp.c        |  2 +-
>>   udp_flow.c   |  1 +
>>   util.c       | 22 +++-------------------
>>   util.h       |  4 +++-
>>   vhost_user.c |  8 ++------
>>   vu_common.c  |  2 +-
>>   17 files changed, 134 insertions(+), 101 deletions(-)
>>   create mode 100644 epoll_ctl.c
>>   create mode 100644 epoll_ctl.h
>>
>> diff --git a/Makefile b/Makefile
>> index 3328f8324140..91e037b8fd3c 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -37,23 +37,23 @@ FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
>>   FLAGS += -DVERSION=\"$(VERSION)\"
>>   FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>>   
>> -PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>> -	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>> -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
>> -	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
>> -	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
>> +PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c epoll_ctl.c \
>> +	flow.c fwd.c icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c \
>> +	log.c mld.c ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c \
>> +	pif.c repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c \
>> +	udp_flow.c udp_vu.c util.c vhost_user.c virtio.c vu_common.c
>>   QRAP_SRCS = qrap.c
>>   PASST_REPAIR_SRCS = passt-repair.c
>>   SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
>>   
>>   MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
>>   
>> -PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>> -	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>> -	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
>> -	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
>> -	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
>> -	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
>> +PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h epoll_ctl.h \
>> +	flow.h fwd.h flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h \
>> +	isolation.h lineread.h log.h migrate.h ndp.h netlink.h packet.h \
>> +	passt.h pasta.h pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h \
>> +	tcp_conn.h tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h \
>> +	udp_internal.h udp_vu.h util.h vhost_user.h virtio.h vu_common.h
>>   HEADERS = $(PASST_HEADERS) seccomp.h
>>   
>>   C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
>> diff --git a/epoll_ctl.c b/epoll_ctl.c
>> new file mode 100644
>> index 000000000000..7a520560aeb9
>> --- /dev/null
>> +++ b/epoll_ctl.c
>> @@ -0,0 +1,45 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/* epoll_ctl.c - epoll manipulation helpers
>> + *
>> + * Copyright Red Hat
>> + * Author: Laurent Vivier <lvivier@redhat.com>
>> + */
>> +
>> +#include <errno.h>
>> +
>> +#include "epoll_ctl.h"
>> +
>> +/**
>> + * epoll_add() - Add a file descriptor to an epollfd
>> + * @epollfd:	epoll file descriptor to add to
>> + * @events:	epoll events
>> + * @ref:	epoll reference for the file descriptor (includes fd and metadata)
>> + *
>> + * Return: 0 on success, negative errno on failure
>> + */
>> +int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref)
>> +{
>> +	struct epoll_event ev;
>> +	int ret;
>> +
>> +	ev.events = events;
>> +	ev.data.u64 = ref->u64;
>> +
>> +	ret = epoll_ctl(epollfd, EPOLL_CTL_ADD, ref->fd, &ev);
>> +	if (ret == -1) {
>> +		ret = -errno;
>> +		err("Failed to add fd to epoll: %s", strerror_(-ret));
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +/**
>> + * epoll_del() - Remove a file descriptor from an epollfd
>> + * @epollfd:	epoll file descriptor to remove from
>> + * @fd:		File descriptor to remove
>> + */
>> +void epoll_del(int epollfd, int fd)
>> +{
>> +	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
>> +}
>> diff --git a/epoll_ctl.h b/epoll_ctl.h
>> new file mode 100644
>> index 000000000000..cf92b0f63f26
>> --- /dev/null
>> +++ b/epoll_ctl.h
>> @@ -0,0 +1,51 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later
>> + * Copyright Red Hat
>> + * Author: Laurent Vivier <lvivier@redhat.com>
>> + */
>> +
>> +#ifndef EPOLL_CTL_H
>> +#define EPOLL_CTL_H
>> +
>> +#include <sys/epoll.h>
>> +
>> +#include "util.h"
>> +#include "passt.h"
>> +#include "epoll_type.h"
>> +#include "flow.h"
>> +#include "tcp.h"
>> +#include "udp.h"
>> +
>> +/**
>> + * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
>> + * @type:	Type of fd (tells us what to do with events)
>> + * @fd:		File descriptor number (implies < 2^24 total descriptors)
>> + * @flow:	Index of the flow this fd is linked to
>> + * @tcp_listen:	TCP-specific reference part for listening sockets
>> + * @udp:	UDP-specific reference part
>> + * @data:	Data handled by protocol handlers
>> + * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
>> + * @queue:	vhost-user queue index for this fd
>> + * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
>> + */
>> +union epoll_ref {
>> +	struct {
>> +		enum epoll_type type:8;
>> +		int32_t		  fd:FD_REF_BITS;
> 
> Same comment as for v3, quoting:
> 
> ---
> Do the definitions of FD_REF_BITS and FD_REF_MAX really belong to
> util.h? It would be clearer to have them here. I didn't check all the
> possible places where you would need to include this header though.
> ---
> 

I didn't move them from util.h because it's not only used by epoll but also by 
tcp_tap_conn. So for me it was a generic definition and that fits with util.h.

But I can move it if you prefer.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/7] epoll_ctl: Extract epoll operations
  2025-10-17 12:21     ` Laurent Vivier
@ 2025-10-17 13:05       ` Stefano Brivio
  0 siblings, 0 replies; 27+ messages in thread
From: Stefano Brivio @ 2025-10-17 13:05 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri, 17 Oct 2025 14:21:35 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> On 17/10/2025 13:48, Stefano Brivio wrote:
> > I was reviewing v3 but I can seamlessly move on to v4, just two things
> > here:
> > 
> > On Fri, 17 Oct 2025 12:31:24 +0200
> > Laurent Vivier <lvivier@redhat.com> wrote:
> >   
> >> Centralize epoll_add() and epoll_del() helper functions into new
> >> epoll_ctl.c/h files.
> >>
> >> This also moves the union epoll_ref definition from passt.h to
> >> epoll_ctl.h where it's more logically placed.
> >>
> >> The new epoll_add() helper simplifies adding file descriptors to epoll
> >> by taking an epoll_ref and events, handling error reporting
> >> consistently across all call sites.
> >>
> >> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> >> ---
> >>   Makefile     | 22 +++++++++++-----------
> >>   epoll_ctl.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >>   epoll_ctl.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   icmp.c       |  4 +---
> >>   passt.c      |  2 +-
> >>   passt.h      | 34 ----------------------------------
> >>   pasta.c      |  7 +++----
> >>   repair.c     | 14 +++++---------
> >>   tap.c        | 13 ++++---------
> >>   tcp.c        |  2 +-
> >>   tcp_splice.c |  2 +-
> >>   udp.c        |  2 +-
> >>   udp_flow.c   |  1 +
> >>   util.c       | 22 +++-------------------
> >>   util.h       |  4 +++-
> >>   vhost_user.c |  8 ++------
> >>   vu_common.c  |  2 +-
> >>   17 files changed, 134 insertions(+), 101 deletions(-)
> >>   create mode 100644 epoll_ctl.c
> >>   create mode 100644 epoll_ctl.h
> >>
> >> diff --git a/Makefile b/Makefile
> >> index 3328f8324140..91e037b8fd3c 100644
> >> --- a/Makefile
> >> +++ b/Makefile
> >> @@ -37,23 +37,23 @@ FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
> >>   FLAGS += -DVERSION=\"$(VERSION)\"
> >>   FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> >>   
> >> -PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> >> -	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> >> -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
> >> -	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
> >> -	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
> >> +PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c epoll_ctl.c \
> >> +	flow.c fwd.c icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c \
> >> +	log.c mld.c ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c \
> >> +	pif.c repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c \
> >> +	udp_flow.c udp_vu.c util.c vhost_user.c virtio.c vu_common.c
> >>   QRAP_SRCS = qrap.c
> >>   PASST_REPAIR_SRCS = passt-repair.c
> >>   SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> >>   
> >>   MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
> >>   
> >> -PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> >> -	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> >> -	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> >> -	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
> >> -	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
> >> -	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
> >> +PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h epoll_ctl.h \
> >> +	flow.h fwd.h flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h \
> >> +	isolation.h lineread.h log.h migrate.h ndp.h netlink.h packet.h \
> >> +	passt.h pasta.h pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h \
> >> +	tcp_conn.h tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h \
> >> +	udp_internal.h udp_vu.h util.h vhost_user.h virtio.h vu_common.h
> >>   HEADERS = $(PASST_HEADERS) seccomp.h
> >>   
> >>   C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> >> diff --git a/epoll_ctl.c b/epoll_ctl.c
> >> new file mode 100644
> >> index 000000000000..7a520560aeb9
> >> --- /dev/null
> >> +++ b/epoll_ctl.c
> >> @@ -0,0 +1,45 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +/* epoll_ctl.c - epoll manipulation helpers
> >> + *
> >> + * Copyright Red Hat
> >> + * Author: Laurent Vivier <lvivier@redhat.com>
> >> + */
> >> +
> >> +#include <errno.h>
> >> +
> >> +#include "epoll_ctl.h"
> >> +
> >> +/**
> >> + * epoll_add() - Add a file descriptor to an epollfd
> >> + * @epollfd:	epoll file descriptor to add to
> >> + * @events:	epoll events
> >> + * @ref:	epoll reference for the file descriptor (includes fd and metadata)
> >> + *
> >> + * Return: 0 on success, negative errno on failure
> >> + */
> >> +int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref)
> >> +{
> >> +	struct epoll_event ev;
> >> +	int ret;
> >> +
> >> +	ev.events = events;
> >> +	ev.data.u64 = ref->u64;
> >> +
> >> +	ret = epoll_ctl(epollfd, EPOLL_CTL_ADD, ref->fd, &ev);
> >> +	if (ret == -1) {
> >> +		ret = -errno;
> >> +		err("Failed to add fd to epoll: %s", strerror_(-ret));
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +/**
> >> + * epoll_del() - Remove a file descriptor from an epollfd
> >> + * @epollfd:	epoll file descriptor to remove from
> >> + * @fd:		File descriptor to remove
> >> + */
> >> +void epoll_del(int epollfd, int fd)
> >> +{
> >> +	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
> >> +}
> >> diff --git a/epoll_ctl.h b/epoll_ctl.h
> >> new file mode 100644
> >> index 000000000000..cf92b0f63f26
> >> --- /dev/null
> >> +++ b/epoll_ctl.h
> >> @@ -0,0 +1,51 @@
> >> +/* SPDX-License-Identifier: GPL-2.0-or-later
> >> + * Copyright Red Hat
> >> + * Author: Laurent Vivier <lvivier@redhat.com>
> >> + */
> >> +
> >> +#ifndef EPOLL_CTL_H
> >> +#define EPOLL_CTL_H
> >> +
> >> +#include <sys/epoll.h>
> >> +
> >> +#include "util.h"
> >> +#include "passt.h"
> >> +#include "epoll_type.h"
> >> +#include "flow.h"
> >> +#include "tcp.h"
> >> +#include "udp.h"
> >> +
> >> +/**
> >> + * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
> >> + * @type:	Type of fd (tells us what to do with events)
> >> + * @fd:		File descriptor number (implies < 2^24 total descriptors)
> >> + * @flow:	Index of the flow this fd is linked to
> >> + * @tcp_listen:	TCP-specific reference part for listening sockets
> >> + * @udp:	UDP-specific reference part
> >> + * @data:	Data handled by protocol handlers
> >> + * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
> >> + * @queue:	vhost-user queue index for this fd
> >> + * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
> >> + */
> >> +union epoll_ref {
> >> +	struct {
> >> +		enum epoll_type type:8;
> >> +		int32_t		  fd:FD_REF_BITS;  
> > 
> > Same comment as for v3, quoting:
> > 
> > ---
> > Do the definitions of FD_REF_BITS and FD_REF_MAX really belong to
> > util.h? It would be clearer to have them here. I didn't check all the
> > possible places where you would need to include this header though.
> > ---
> >   
> 
> I didn't move them from util.h because it's not only used by epoll but also by 
> tcp_tap_conn. So for me it was a generic definition and that fits with util.h.

Oh, I see your point now, and we limit the number of bits we use in the
two structures because we want to save space in both, separately, so
also the reason for the limit is somehow generic.

> But I can move it if you prefer.

No no, sorry, never mind, I don't know why I missed that.

-- 
Stefano


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common
  2025-10-17 10:31 ` [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common Laurent Vivier
@ 2025-10-17 17:43   ` Stefano Brivio
  2025-10-21 13:13     ` Laurent Vivier
  2025-10-20  1:34   ` David Gibson
  1 sibling, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2025-10-17 17:43 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev, David Gibson

Nits only:

On Fri, 17 Oct 2025 12:31:26 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
> whether a connection was registered with epoll, not which epoll instance.
> This limited flexibility for future multi-epoll support.
> 
> Replace the boolean with a threadnb field in flow_common that identifies
> which thread (and thus which epoll instance) the flow is registered with.
> Use FLOW_THREADNB_INVALID to indicate when a flow is not registered with
> any epoll instance. A threadnb_to_epollfd[] mapping table translates
> thread numbers to their corresponding epoll file descriptors.

It wasn't really clear to me until I actually started digging into this
change what NB meant there. For me it's "notifier block" before
"number", but that didn't make sense either.

Some ideas (even though maybe as David suggested this name
shouldn't have "thread" in it at all):

- f->thread / FLOW_THREAD_INVALID

- f->thread_id / FLOW_THREAD_ID_INVALID

...or f->epoll_id, reflecting David's observation?

> Add helper functions:
> - flow_in_epoll() to check if a flow is registered with epoll
> - flow_epollfd() to retrieve the epoll fd for a flow's thread
> - flow_thread_register() to register an epoll fd with a thread
> - flow_thread_set() to set the thread number of a flow
> 
> This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
> the need to pass the context 'c', since the epoll fd is now directly
> accessible from the flow structure via flow_epollfd().
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  flow.c       | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  flow.h       | 12 ++++++++++++
>  passt.c      |  1 +
>  tcp.c        | 39 ++++++++++++++++++++------------------
>  tcp_conn.h   |  8 +-------
>  tcp_splice.c | 24 ++++++++++++------------
>  6 files changed, 99 insertions(+), 38 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index b14e9d8b63ff..d56bae776239 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -116,6 +116,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>  unsigned flow_first_free;
>  union flow flowtab[FLOW_MAX];
>  static const union flow *flow_new_entry; /* = NULL */

It would be nice to have an idea of how this table is organised without
looking at how it's used, say:

/* Table of epoll file descriptors, indexed by thread number */

...or indexed by "epoll identifiers", perhaps, if we want to drop
references to threads here.

> +static int threadnb_to_epollfd[FLOW_THREADNB_SIZE];
>  
>  /* Hash table to index it */
>  #define FLOW_HASH_LOAD		70		/* % */
> @@ -347,6 +348,55 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
>  	flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
>  }
>  
> +/**
> + * flow_in_epoll() - Check if flow is registered with an epoll instance
> + * @f:		Flow to check
> + *
> + * Return: true if flow is registered with epoll, false otherwise
> + */
> +bool flow_in_epoll(const struct flow_common *f)
> +{
> +	return f->threadnb != FLOW_THREADNB_INVALID;
> +}
> +
> +/**
> + * flow_epollfd() - Get the epoll file descriptor for a flow
> + * @f:		Flow to query
> + *
> + * Return: epoll file descriptor associated with the flow's thread
> + */
> +int flow_epollfd(const struct flow_common *f)
> +{
> +	ASSERT(f->threadnb < FLOW_THREADNB_MAX);
> +
> +	return threadnb_to_epollfd[f->threadnb];
> +}
> +
> +/**
> + * flow_thread_set() - Associate a flow with a thread
> + * @f:		Flow to update
> + * @threadnb:	Thread number to associate with this flow
> + */
> +void flow_thread_set(struct flow_common *f, int threadnb)
> +{
> +	ASSERT(threadnb < FLOW_THREADNB_MAX);
> +
> +	f->threadnb = threadnb;
> +}
> +
> +/**
> + * flow_thread_register() - Initialize the threadnb -> epollfd mapping
> + * @threadnb:	Thread number to associate to
> + * @epollfd:	epoll file descriptor for the thread
> + */
> +void flow_thread_register(int threadnb, int epollfd)
> +{
> +	ASSERT(threadnb < FLOW_THREADNB_MAX);
> +	ASSERT(epollfd >= 0);
> +
> +	threadnb_to_epollfd[threadnb] = epollfd;
> +}
> +
>  /**
>   * flow_initiate_() - Move flow to INI, setting pif[INISIDE]
>   * @flow:	Flow to change state
> @@ -548,6 +598,7 @@ union flow *flow_alloc(void)
>  
>  	flow_new_entry = flow;
>  	memset(flow, 0, sizeof(*flow));
> +	flow->f.threadnb = FLOW_THREADNB_INVALID;
>  	flow_set_state(&flow->f, FLOW_STATE_NEW);
>  
>  	return flow;
> @@ -827,7 +878,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  		case FLOW_TCP_SPLICE:
>  			closed = tcp_splice_flow_defer(&flow->tcp_splice);
>  			if (!closed && timer)
> -				tcp_splice_timer(c, &flow->tcp_splice);
> +				tcp_splice_timer(&flow->tcp_splice);
>  			break;
>  		case FLOW_PING4:
>  		case FLOW_PING6:
> diff --git a/flow.h b/flow.h
> index ef138b83add8..700d8b32c990 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -177,6 +177,8 @@ int flowside_connect(const struct ctx *c, int s,
>   * @type:	Type of packet flow
>   * @pif[]:	Interface for each side of the flow
>   * @side[]:	Information for each side of the flow
> + * @threadnb:	Thread number flow is registered with
> + *		(FLOW_THREADNB_INVALID if not)

You could phrase this more directly, say, "thread identifier, or
FLOW_THREAD_INVALID", or "epollfd identifier, or EPOLLFD_ID_INVALID".

This is a structure describing flows, it's not surprising it's about
the flow.

>   */
>  struct flow_common {
>  #ifdef __GNUC__
> @@ -192,8 +194,14 @@ struct flow_common {
>  #endif
>  	uint8_t		pif[SIDES];
>  	struct flowside	side[SIDES];
> +#define FLOW_THREADNB_BITS 8
> +	unsigned int	threadnb:FLOW_THREADNB_BITS;
>  };
>  
> +#define FLOW_THREADNB_SIZE	(1 << FLOW_THREADNB_BITS)
> +#define FLOW_THREADNB_MAX	(FLOW_THREADNB_SIZE - 1)
> +#define FLOW_THREADNB_INVALID	FLOW_THREADNB_MAX
> +
>  #define FLOW_INDEX_BITS		17	/* 128k - 1 */
>  #define FLOW_MAX		MAX_FROM_BITS(FLOW_INDEX_BITS)
>  
> @@ -249,6 +257,10 @@ flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
>  union flow;
>  
>  void flow_init(void);
> +bool flow_in_epoll(const struct flow_common *f);
> +int flow_epollfd(const struct flow_common *f);
> +void flow_thread_set(struct flow_common *f, int threadnb);
> +void flow_thread_register(int threadnb, int epollfd);
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
>  int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
>  			      int fd);
> diff --git a/passt.c b/passt.c
> index af928111786b..37f2c897be84 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -285,6 +285,7 @@ int main(int argc, char **argv)
>  	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
>  	if (c.epollfd == -1)
>  		die_perror("Failed to create epoll file descriptor");
> +	flow_thread_register(0, c.epollfd);

Temporary, I suppose. If not, maybe it deserves its own constant, such
as EPOLLFD_ID_DEFAULT?

>  	if (getrlimit(RLIMIT_NOFILE, &limit))
>  		die_perror("Failed to get maximum value of open files limit");
> diff --git a/tcp.c b/tcp.c
> index db9f17c0622f..8c49852b8454 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -504,25 +504,27 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
>   */
>  static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
> -	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
> +	int m = flow_in_epoll(&conn->f) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>  	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
>  		                .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), };
>  	struct epoll_event ev = { .data.u64 = ref.u64 };
> +	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
> +						   : c->epollfd;

We usually align the second expression to the ternary operator, see
also example in sock_l4_sa(), say:

	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
					      : c->epollfd;

>  
>  	if (conn->events == CLOSED) {
> -		if (conn->in_epoll)
> -			epoll_del(c->epollfd, conn->sock);
> +		if (flow_in_epoll(&conn->f))
> +			epoll_del(epollfd, conn->sock);
>  		if (conn->timer != -1)
> -			epoll_del(c->epollfd, conn->timer);
> +			epoll_del(epollfd, conn->timer);
>  		return 0;
>  	}
>  
>  	ev.events = tcp_conn_epoll_events(conn->events, conn->flags);
>  
> -	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
> +	if (epoll_ctl(epollfd, m, conn->sock, &ev))
>  		return -errno;
>  
> -	conn->in_epoll = true;
> +	flow_thread_set(&conn->f, 0);

Not temporary I guess, maybe it's an EPOLLFD_ID_DEFAULT?

>  
>  	if (conn->timer != -1) {
>  		union epoll_ref ref_t = { .type = EPOLL_TYPE_TCP_TIMER,
> @@ -531,7 +533,8 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  		struct epoll_event ev_t = { .data.u64 = ref_t.u64,
>  					    .events = EPOLLIN | EPOLLET };
>  
> -		if (epoll_ctl(c->epollfd, EPOLL_CTL_MOD, conn->timer, &ev_t))
> +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_MOD,
> +			      conn->timer, &ev_t))
>  			return -errno;
>  	}
>  
> @@ -540,12 +543,11 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  
>  /**
>   * tcp_timer_ctl() - Set timerfd based on flags/events, create timerfd if needed
> - * @c:		Execution context
>   * @conn:	Connection pointer
>   *
>   * #syscalls timerfd_create timerfd_settime
>   */
> -static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> +static void tcp_timer_ctl(struct tcp_tap_conn *conn)
>  {
>  	struct itimerspec it = { { 0 }, { 0 } };
>  
> @@ -570,7 +572,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  		}
>  		conn->timer = fd;
>  
> -		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
> +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_ADD,

I wouldn't find it outrageous if you assigned an epollfd local variable
first so that this doesn't need two lines.

> +			      conn->timer, &ev)) {
>  			flow_dbg_perror(conn, "failed to add timer");
>  			close(conn->timer);
>  			conn->timer = -1;
> @@ -628,7 +631,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  			 * flags and factor this into the logic below.
>  			 */
>  			if (flag == ACK_FROM_TAP_DUE)
> -				tcp_timer_ctl(c, conn);
> +				tcp_timer_ctl(conn);
>  
>  			return;
>  		}
> @@ -644,7 +647,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  	if (flag == ACK_FROM_TAP_DUE || flag == ACK_TO_TAP_DUE		  ||
>  	    (flag == ~ACK_FROM_TAP_DUE && (conn->flags & ACK_TO_TAP_DUE)) ||
>  	    (flag == ~ACK_TO_TAP_DUE   && (conn->flags & ACK_FROM_TAP_DUE)))
> -		tcp_timer_ctl(c, conn);
> +		tcp_timer_ctl(conn);
>  }
>  
>  /**
> @@ -699,7 +702,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp_epoll_ctl(c, conn);
>  
>  	if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
> -		tcp_timer_ctl(c, conn);
> +		tcp_timer_ctl(conn);
>  }
>  
>  /**
> @@ -1757,7 +1760,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  				   seq, conn->seq_from_tap);
>  
>  			tcp_send_flag(c, conn, ACK);
> -			tcp_timer_ctl(c, conn);
> +			tcp_timer_ctl(conn);
>  
>  			if (p->count == 1) {
>  				tcp_tap_window_update(c, conn,
> @@ -2406,7 +2409,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>  
>  	if (conn->flags & ACK_TO_TAP_DUE) {
>  		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> -		tcp_timer_ctl(c, conn);
> +		tcp_timer_ctl(conn);
>  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>  		if (!(conn->events & ESTABLISHED)) {
>  			flow_dbg(conn, "handshake timeout");
> @@ -2428,7 +2431,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>  				return;
>  
>  			tcp_data_from_sock(c, conn);
> -			tcp_timer_ctl(c, conn);
> +			tcp_timer_ctl(conn);
>  		}
>  	} else {
>  		struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
> @@ -3476,7 +3479,7 @@ int tcp_flow_migrate_source_ext(const struct ctx *c,
>  	if (c->migrate_no_linger)
>  		close(s);
>  	else
> -		epoll_del(c->epollfd, s);
> +		epoll_del(flow_epollfd(&conn->f), s);
>  
>  	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
>  	 * based on the end of the queues.
> @@ -3625,7 +3628,7 @@ static int tcp_flow_repair_connect(const struct ctx *c,
>  		return rc;
>  	}
>  
> -	conn->in_epoll = 0;
> +	conn->f.threadnb = FLOW_THREADNB_INVALID;
>  	conn->timer = -1;
>  	conn->listening_sock = -1;
>  
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 38b5c541f003..81333122d531 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -12,7 +12,6 @@
>  /**
>   * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
>   * @f:			Generic flow information
> - * @in_epoll:		Is the connection in the epoll set?
>   * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
>   * @ws_from_tap:	Window scaling factor advertised from tap/guest
>   * @ws_to_tap:		Window scaling factor advertised to tap/guest
> @@ -36,8 +35,6 @@ struct tcp_tap_conn {
>  	/* Must be first element */
>  	struct flow_common f;
>  
> -	bool		in_epoll	:1;
> -
>  #define TCP_RETRANS_BITS		3
>  	unsigned int	retrans		:TCP_RETRANS_BITS;
>  #define TCP_MAX_RETRANS			MAX_FROM_BITS(TCP_RETRANS_BITS)
> @@ -196,7 +193,6 @@ struct tcp_tap_transfer_ext {
>   * @written:		Bytes written (not fully written from one other side read)
>   * @events:		Events observed/actions performed on connection
>   * @flags:		Connection flags (attributes, not events)
> - * @in_epoll:		Is the connection in the epoll set?
>   */
>  struct tcp_splice_conn {
>  	/* Must be first element */
> @@ -220,8 +216,6 @@ struct tcp_splice_conn {
>  #define RCVLOWAT_SET(sidei_)		((sidei_) ? BIT(1) : BIT(0))
>  #define RCVLOWAT_ACT(sidei_)		((sidei_) ? BIT(3) : BIT(2))
>  #define CLOSING				BIT(4)
> -
> -	bool in_epoll	:1;
>  };
>  
>  /* Socket pools */
> @@ -245,7 +239,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
>  bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
>  
>  bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
> -void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
> +void tcp_splice_timer(struct tcp_splice_conn *conn);
>  int tcp_conn_pool_sock(int pool[]);
>  int tcp_conn_sock(sa_family_t af);
>  int tcp_sock_refill_pool(int pool[], sa_family_t af);
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 6f21184bdc55..703bd7610890 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -149,7 +149,9 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
>  static int tcp_splice_epoll_ctl(const struct ctx *c,
>  				struct tcp_splice_conn *conn)
>  {
> -	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
> +	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
> +						   : c->epollfd;

Same as above.

Given the new, more limited usage of c->epollfd, if it's not a temporary
thing (I couldn't quite guess what your plan is), maybe worth updating
its documentation in struct ctx?

> +	int m = flow_in_epoll(&conn->f) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>  	const union epoll_ref ref[SIDES] = {
>  		{ .type = EPOLL_TYPE_TCP_SPLICE, .fd = conn->s[0],
>  		  .flowside = FLOW_SIDX(conn, 0) },
> @@ -161,25 +163,24 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
>  
>  	tcp_splice_conn_epoll_events(conn->events, ev);
>  
> -	if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) ||
> -	    epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) {
> +
> +	if (epoll_ctl(epollfd, m, conn->s[0], &ev[0]) ||
> +	    epoll_ctl(epollfd, m, conn->s[1], &ev[1])) {
>  		int ret = -errno;
>  		flow_perror(conn, "ERROR on epoll_ctl()");
>  		return ret;
>  	}
> -
> -	conn->in_epoll = true;
> +	flow_thread_set(&conn->f, 0);
>  
>  	return 0;
>  }
>  
>  /**
>   * conn_flag_do() - Set/unset given flag, log, update epoll on CLOSING flag
> - * @c:		Execution context
>   * @conn:	Connection pointer
>   * @flag:	Flag to set, or ~flag to unset
>   */
> -static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
> +static void conn_flag_do(struct tcp_splice_conn *conn,
>  			 unsigned long flag)
>  {
>  	if (flag & (flag - 1)) {
> @@ -204,15 +205,15 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
>  	}
>  
>  	if (flag == CLOSING) {
> -		epoll_del(c->epollfd, conn->s[0]);
> -		epoll_del(c->epollfd, conn->s[1]);
> +		epoll_del(flow_epollfd(&conn->f), conn->s[0]);
> +		epoll_del(flow_epollfd(&conn->f), conn->s[1]);
>  	}
>  }
>  
>  #define conn_flag(c, conn, flag)					\
>  	do {								\
>  		flow_trace(conn, "flag at %s:%i", __func__, __LINE__);	\
> -		conn_flag_do(c, conn, flag);				\
> +		conn_flag_do(conn, flag);				\
>  	} while (0)
>  
>  /**
> @@ -751,10 +752,9 @@ void tcp_splice_init(struct ctx *c)
>  
>  /**
>   * tcp_splice_timer() - Timer for spliced connections
> - * @c:		Execution context
>   * @conn:	Connection to handle
>   */
> -void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn)
> +void tcp_splice_timer(struct tcp_splice_conn *conn)
>  {
>  	unsigned sidei;
>  

-- 
Stefano


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 7/7] passt: Move main event loop processing into passt_worker()
  2025-10-17 10:31 ` [PATCH v4 7/7] passt: Move main event loop processing into passt_worker() Laurent Vivier
@ 2025-10-17 17:43   ` Stefano Brivio
  2025-10-20  1:43   ` David Gibson
  1 sibling, 0 replies; 27+ messages in thread
From: Stefano Brivio @ 2025-10-17 17:43 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Fri, 17 Oct 2025 12:31:29 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> Extract the epoll event processing logic from main() into a separate
> passt_worker() function. This refactoring prepares the code for future
> threading support where passt_worker() will be called as a worker thread
> callback.
> 
> The new function handles:
> - Processing epoll events and dispatching to protocol handlers
> - Event statistics tracking and printing
> - Post-handler periodic tasks (timers, deferred work)
> - Migration handling
> 
> No functional changes, purely a code restructuring.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  passt.c | 160 +++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 88 insertions(+), 72 deletions(-)
> 
> diff --git a/passt.c b/passt.c
> index 37f2c897be84..5bfa4c6353d9 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -229,6 +229,92 @@ static void print_stats(const struct ctx *c, const struct passt_stats *stats,
>  	lines_printed++;
>  }
>  
> +/**
> + * passt_worker() - Process epoll events and handle protocol operations
> + * @opaque:	Pointer to execution context (struct ctx)
> + * @nfds:	Number of file descriptors ready (epoll_wait return value)
> + * @events:	epoll_event array of ready file descriptors
> + */
> +static void passt_worker(void *opaque, int nfds,  struct epoll_event *events)

Nit: excess whitespace. The rest of the series looks good to me, all
tests pass as well.

-- 
Stefano


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/7] epoll_ctl: Extract epoll operations
  2025-10-17 10:31 ` [PATCH v4 2/7] epoll_ctl: Extract epoll operations Laurent Vivier
  2025-10-17 11:48   ` Stefano Brivio
@ 2025-10-20  1:20   ` David Gibson
  2025-10-21 11:52     ` Laurent Vivier
  1 sibling, 1 reply; 27+ messages in thread
From: David Gibson @ 2025-10-20  1:20 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 20740 bytes --]

On Fri, Oct 17, 2025 at 12:31:24PM +0200, Laurent Vivier wrote:
> Centralize epoll_add() and epoll_del() helper functions into new
> epoll_ctl.c/h files.
> 
> This also moves the union epoll_ref definition from passt.h to
> epoll_ctl.h where it's more logically placed.
> 
> The new epoll_add() helper simplifies adding file descriptors to epoll
> by taking an epoll_ref and events, handling error reporting
> consistently across all call sites.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Concept looks good, some minor details in execution.

> ---
>  Makefile     | 22 +++++++++++-----------
>  epoll_ctl.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  epoll_ctl.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  icmp.c       |  4 +---
>  passt.c      |  2 +-
>  passt.h      | 34 ----------------------------------
>  pasta.c      |  7 +++----
>  repair.c     | 14 +++++---------
>  tap.c        | 13 ++++---------
>  tcp.c        |  2 +-
>  tcp_splice.c |  2 +-
>  udp.c        |  2 +-
>  udp_flow.c   |  1 +
>  util.c       | 22 +++-------------------
>  util.h       |  4 +++-
>  vhost_user.c |  8 ++------
>  vu_common.c  |  2 +-
>  17 files changed, 134 insertions(+), 101 deletions(-)
>  create mode 100644 epoll_ctl.c
>  create mode 100644 epoll_ctl.h
> 
> diff --git a/Makefile b/Makefile
> index 3328f8324140..91e037b8fd3c 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -37,23 +37,23 @@ FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE)
>  FLAGS += -DVERSION=\"$(VERSION)\"
>  FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  
> -PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> -	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
> -	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
> -	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
> +PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c epoll_ctl.c \
> +	flow.c fwd.c icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c \
> +	log.c mld.c ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c \
> +	pif.c repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c \
> +	udp_flow.c udp_vu.c util.c vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  PASST_REPAIR_SRCS = passt-repair.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
>  
>  MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
>  
> -PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> -	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> -	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> -	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
> -	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
> -	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
> +PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h epoll_ctl.h \
> +	flow.h fwd.h flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h \
> +	isolation.h lineread.h log.h migrate.h ndp.h netlink.h packet.h \
> +	passt.h pasta.h pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h \
> +	tcp_conn.h tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h \
> +	udp_internal.h udp_vu.h util.h vhost_user.h virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> diff --git a/epoll_ctl.c b/epoll_ctl.c
> new file mode 100644
> index 000000000000..7a520560aeb9
> --- /dev/null
> +++ b/epoll_ctl.c
> @@ -0,0 +1,45 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* epoll_ctl.c - epoll manipulation helpers
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#include <errno.h>
> +
> +#include "epoll_ctl.h"
> +
> +/**
> + * epoll_add() - Add a file descriptor to an epollfd
> + * @epollfd:	epoll file descriptor to add to
> + * @events:	epoll events
> + * @ref:	epoll reference for the file descriptor (includes fd and metadata)
> + *
> + * Return: 0 on success, negative errno on failure
> + */
> +int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref)

This can take ref by value.  It may be a union, but it fits in
64-bits, so it can be passed in a register.  If there is a reason to
keep it a pointer, then it should be a const pointer, since you don't
modify it (I'm slightly surprised cppcheck didn't whinge about it).

> +{
> +	struct epoll_event ev;
> +	int ret;
> +
> +	ev.events = events;
> +	ev.data.u64 = ref->u64;
> +
> +	ret = epoll_ctl(epollfd, EPOLL_CTL_ADD, ref->fd, &ev);
> +	if (ret == -1) {

I generally like to use ret < 0 out of defensiveness, even though
system calls are defined to return exactly -1 on error.

> +		ret = -errno;
> +		err("Failed to add fd to epoll: %s", strerror_(-ret));
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * epoll_del() - Remove a file descriptor from an epollfd
> + * @epollfd:	epoll file descriptor to remove from
> + * @fd:		File descriptor to remove
> + */
> +void epoll_del(int epollfd, int fd)
> +{
> +	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
> +}
> diff --git a/epoll_ctl.h b/epoll_ctl.h
> new file mode 100644
> index 000000000000..cf92b0f63f26
> --- /dev/null
> +++ b/epoll_ctl.h
> @@ -0,0 +1,51 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef EPOLL_CTL_H
> +#define EPOLL_CTL_H
> +
> +#include <sys/epoll.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "epoll_type.h"
> +#include "flow.h"
> +#include "tcp.h"
> +#include "udp.h"
> +
> +/**
> + * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
> + * @type:	Type of fd (tells us what to do with events)
> + * @fd:		File descriptor number (implies < 2^24 total descriptors)
> + * @flow:	Index of the flow this fd is linked to
> + * @tcp_listen:	TCP-specific reference part for listening sockets
> + * @udp:	UDP-specific reference part
> + * @data:	Data handled by protocol handlers
> + * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
> + * @queue:	vhost-user queue index for this fd
> + * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
> + */
> +union epoll_ref {
> +	struct {
> +		enum epoll_type type:8;
> +		int32_t		  fd:FD_REF_BITS;
> +		union {
> +			uint32_t flow;
> +			flow_sidx_t flowside;
> +			union tcp_listen_epoll_ref tcp_listen;
> +			union udp_listen_epoll_ref udp;
> +			uint32_t data;
> +			int nsdir_fd;
> +			int queue;
> +		};
> +	};
> +	uint64_t u64;
> +};
> +static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
> +	      "epoll_ref must have same size as epoll_data");
> +
> +int epoll_add(int epollfd, uint32_t events, union epoll_ref *ref);
> +void epoll_del(int epollfd, int fd);
> +#endif /* EPOLL_CTL_H */
> diff --git a/icmp.c b/icmp.c
> index bd3108a21675..c26561da80bf 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -15,7 +15,6 @@
>  #include <errno.h>
>  #include <net/ethernet.h>
>  #include <net/if.h>
> -#include <netinet/in.h>
>  #include <netinet/ip.h>
>  #include <netinet/ip_icmp.h>
>  #include <stdio.h>
> @@ -23,10 +22,8 @@
>  #include <stdint.h>
>  #include <stddef.h>
>  #include <string.h>
> -#include <sys/epoll.h>
>  #include <sys/types.h>
>  #include <sys/socket.h>
> -#include <unistd.h>
>  #include <time.h>
>  
>  #include <linux/icmpv6.h>
> @@ -41,6 +38,7 @@
>  #include "inany.h"
>  #include "icmp.h"
>  #include "flow_table.h"
> +#include "epoll_ctl.h"
>  
>  #define ICMP_ECHO_TIMEOUT	60 /* s, timeout for ICMP socket activity */
>  #define ICMP_NUM_IDS		(1U << 16)
> diff --git a/passt.c b/passt.c
> index bdb7b6935f0c..af928111786b 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -19,7 +19,6 @@
>   * created in a separate network namespace).
>   */
>  
> -#include <sys/epoll.h>
>  #include <fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/resource.h>
> @@ -53,6 +52,7 @@
>  #include "vu_common.h"
>  #include "migrate.h"
>  #include "repair.h"
> +#include "epoll_ctl.h"
>  
>  #define NUM_EPOLL_EVENTS	8
>  
> diff --git a/passt.h b/passt.h
> index 0075eb4b3b16..befe56bb167b 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -35,40 +35,6 @@ union epoll_ref;
>  #define MAC_OUR_LAA	\
>  	((uint8_t [ETH_ALEN]){0x9a, 0x55, 0x9a, 0x55, 0x9a, 0x55})
>  
> -/**
> - * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
> - * @type:	Type of fd (tells us what to do with events)
> - * @fd:		File descriptor number (implies < 2^24 total descriptors)
> - * @flow:	Index of the flow this fd is linked to
> - * @tcp_listen:	TCP-specific reference part for listening sockets
> - * @udp:	UDP-specific reference part
> - * @icmp:	ICMP-specific reference part
> - * @data:	Data handled by protocol handlers
> - * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is gone
> - * @queue:	vhost-user queue index for this fd
> - * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
> - */
> -union epoll_ref {
> -	struct {
> -		enum epoll_type type:8;
> -#define FD_REF_BITS		24
> -#define FD_REF_MAX		((int)MAX_FROM_BITS(FD_REF_BITS))
> -		int32_t		fd:FD_REF_BITS;
> -		union {
> -			uint32_t flow;
> -			flow_sidx_t flowside;
> -			union tcp_listen_epoll_ref tcp_listen;
> -			union udp_listen_epoll_ref udp;
> -			uint32_t data;
> -			int nsdir_fd;
> -			int queue;
> -		};
> -	};
> -	uint64_t u64;
> -};
> -static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
> -	      "epoll_ref must have same size as epoll_data");
> -
>  /* Large enough for ~128 maximum size frames */
>  #define PKT_BUF_BYTES		(8UL << 20)
>  
> diff --git a/pasta.c b/pasta.c
> index 687406b6e736..e905f6d33b95 100644
> --- a/pasta.c
> +++ b/pasta.c
> @@ -27,7 +27,6 @@
>  #include <stdint.h>
>  #include <unistd.h>
>  #include <syslog.h>
> -#include <sys/epoll.h>
>  #include <sys/inotify.h>
>  #include <sys/mount.h>
>  #include <sys/timerfd.h>
> @@ -49,6 +48,7 @@
>  #include "isolation.h"
>  #include "netlink.h"
>  #include "log.h"
> +#include "epoll_ctl.h"
>  
>  #define HOSTNAME_PREFIX		"pasta-"
>  
> @@ -444,7 +444,6 @@ static int pasta_netns_quit_timer(void)
>   */
>  void pasta_netns_quit_init(const struct ctx *c)
>  {
> -	struct epoll_event ev = { .events = EPOLLIN };
>  	int flags = O_NONBLOCK | O_CLOEXEC;
>  	struct statfs s = { 0 };
>  	bool try_inotify = true;
> @@ -487,8 +486,8 @@ void pasta_netns_quit_init(const struct ctx *c)
>  		die("netns monitor file number %i too big, exiting", fd);
>  
>  	ref.fd = fd;
> -	ev.data.u64 = ref.u64;
> -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev);
> +
> +	epoll_add(c->epollfd, EPOLLIN, &ref);
>  }
>  
>  /**
> diff --git a/repair.c b/repair.c
> index f6b1bf36479c..c8f4737fa62a 100644
> --- a/repair.c
> +++ b/repair.c
> @@ -22,6 +22,7 @@
>  #include "inany.h"
>  #include "flow.h"
>  #include "flow_table.h"
> +#include "epoll_ctl.h"
>  
>  #include "repair.h"
>  
> @@ -47,7 +48,6 @@ static int repair_nfds;
>  void repair_sock_init(const struct ctx *c)
>  {
>  	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
> -	struct epoll_event ev = { 0 };
>  
>  	if (c->fd_repair_listen == -1)
>  		return;
> @@ -58,9 +58,7 @@ void repair_sock_init(const struct ctx *c)
>  	}
>  
>  	ref.fd = c->fd_repair_listen;
> -	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
> -	ev.data.u64 = ref.u64;
> -	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
> +	if (epoll_add(c->epollfd, EPOLLIN | EPOLLHUP | EPOLLET, &ref))
>  		err_perror("repair helper socket epoll_ctl(), won't migrate");

Subtle breakage here.  err_perror() expects the error code in errno,
but epoll_add() puts it in the return value (and clobbered errno by
calling err()).

>  }
>  
> @@ -74,7 +72,6 @@ void repair_sock_init(const struct ctx *c)
>  int repair_listen_handler(struct ctx *c, uint32_t events)
>  {
>  	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
> -	struct epoll_event ev = { 0 };
>  	struct ucred ucred;
>  	socklen_t len;
>  	int rc;
> @@ -112,10 +109,9 @@ int repair_listen_handler(struct ctx *c, uint32_t events)
>  		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
>  
>  	ref.fd = c->fd_repair;
> -	ev.events = EPOLLHUP | EPOLLET;
> -	ev.data.u64 = ref.u64;
> -	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
> -		rc = errno;
> +
> +	rc = epoll_add(c->epollfd, EPOLLHUP | EPOLLET, &ref);
> +	if (rc < 0) {
>  		debug_perror("epoll_ctl() on TCP_REPAIR helper socket");

Same here.

>  		close(c->fd_repair);
>  		c->fd_repair = -1;
> diff --git a/tap.c b/tap.c
> index 9812f120d426..5b9403dce25b 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -26,7 +26,6 @@
>  #include <netinet/in.h>
>  #include <arpa/inet.h>
>  #include <stdint.h>
> -#include <sys/epoll.h>
>  #include <sys/ioctl.h>
>  #include <sys/types.h>
>  #include <sys/stat.h>
> @@ -61,6 +60,7 @@
>  #include "log.h"
>  #include "vhost_user.h"
>  #include "vu_common.h"
> +#include "epoll_ctl.h"
>  
>  /* Maximum allowed frame lengths (including L2 header) */
>  
> @@ -1327,14 +1327,12 @@ static void tap_backend_show_hints(struct ctx *c)
>  static void tap_sock_unix_init(const struct ctx *c)
>  {
>  	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
> -	struct epoll_event ev = { 0 };
>  
>  	listen(c->fd_tap_listen, 0);
>  
>  	ref.fd = c->fd_tap_listen;
> -	ev.events = EPOLLIN | EPOLLET;
> -	ev.data.u64 = ref.u64;
> -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
> +
> +	epoll_add(c->epollfd, EPOLLIN | EPOLLET, &ref);

Preexisting, but we should probably check for errors here.

>  }
>  
>  /**
> @@ -1343,7 +1341,6 @@ static void tap_sock_unix_init(const struct ctx *c)
>   */
>  static void tap_start_connection(const struct ctx *c)
>  {
> -	struct epoll_event ev = { 0 };
>  	union epoll_ref ref = { 0 };
>  
>  	ref.fd = c->fd_tap;
> @@ -1359,9 +1356,7 @@ static void tap_start_connection(const struct ctx *c)
>  		break;
>  	}
>  
> -	ev.events = EPOLLIN | EPOLLRDHUP;
> -	ev.data.u64 = ref.u64;
> -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> +	epoll_add(c->epollfd, EPOLLIN | EPOLLRDHUP, &ref);

And here.

>  	if (c->ifi4)
>  		arp_send_init_req(c);
> diff --git a/tcp.c b/tcp.c
> index 745353f782f5..db9f17c0622f 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -279,7 +279,6 @@
>  #include <stdbool.h>
>  #include <stddef.h>
>  #include <string.h>
> -#include <sys/epoll.h>
>  #include <sys/ioctl.h>
>  #include <sys/socket.h>
>  #include <sys/timerfd.h>
> @@ -309,6 +308,7 @@
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
>  #include "tcp_vu.h"
> +#include "epoll_ctl.h"
>  
>  /*
>   * The size of TCP header (including options) is given by doff (Data Offset)
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 666ee62b738f..6f21184bdc55 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -44,7 +44,6 @@
>  #include <net/ethernet.h>
>  #include <netinet/in.h>
>  #include <netinet/tcp.h>
> -#include <sys/epoll.h>
>  #include <sys/types.h>
>  #include <sys/socket.h>
>  
> @@ -56,6 +55,7 @@
>  #include "siphash.h"
>  #include "inany.h"
>  #include "flow.h"
> +#include "epoll_ctl.h"
>  
>  #include "flow_table.h"
>  
> diff --git a/udp.c b/udp.c
> index 86585b7e0942..3812d5c2336f 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -94,7 +94,6 @@
>  #include <stdint.h>
>  #include <stddef.h>
>  #include <string.h>
> -#include <sys/epoll.h>
>  #include <sys/types.h>
>  #include <sys/socket.h>
>  #include <sys/uio.h>
> @@ -115,6 +114,7 @@
>  #include "flow_table.h"
>  #include "udp_internal.h"
>  #include "udp_vu.h"
> +#include "epoll_ctl.h"
>  
>  #define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
>  
> diff --git a/udp_flow.c b/udp_flow.c
> index 84973f807167..d9c75f1bb1d8 100644
> --- a/udp_flow.c
> +++ b/udp_flow.c
> @@ -15,6 +15,7 @@
>  #include "passt.h"
>  #include "flow_table.h"
>  #include "udp_internal.h"
> +#include "epoll_ctl.h"
>  
>  #define UDP_CONN_TIMEOUT	180 /* s, timeout for ephemeral or local bind */
>  
> diff --git a/util.c b/util.c
> index 1067486be414..b2490123590a 100644
> --- a/util.c
> +++ b/util.c
> @@ -18,7 +18,6 @@
>  #include <unistd.h>
>  #include <arpa/inet.h>
>  #include <net/ethernet.h>
> -#include <sys/epoll.h>
>  #include <sys/uio.h>
>  #include <fcntl.h>
>  #include <string.h>
> @@ -35,6 +34,7 @@
>  #include "packet.h"
>  #include "log.h"
>  #include "pcap.h"
> +#include "epoll_ctl.h"
>  #ifdef HAS_GETRANDOM
>  #include <sys/random.h>
>  #endif
> @@ -58,7 +58,6 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  	sa_family_t af = ((const struct sockaddr *)sa)->sa_family;
>  	union epoll_ref ref = { .type = type, .data = data };
>  	bool freebind = false;
> -	struct epoll_event ev;
>  	int fd, y = 1, ret;
>  	uint8_t proto;
>  	int socktype;
> @@ -172,13 +171,9 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  		return ret;
>  	}
>  
> -	ev.events = EPOLLIN;
> -	ev.data.u64 = ref.u64;
> -	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) {
> -		ret = -errno;
> -		warn("L4 epoll_ctl: %s", strerror_(-ret));
> +	ret = epoll_add(c->epollfd, EPOLLIN, &ref);
> +	if (ret < 0)
>  		return ret;
> -	}
>  
>  	return fd;
>  }
> @@ -994,17 +989,6 @@ void raw_random(void *buf, size_t buflen)
>  		die("Unexpected EOF on random data source");
>  }
>  
> -/**
> - * epoll_del() - Remove a file descriptor from our passt epoll
> - * @epollfd:	epoll file descriptor to remove from
> - * @fd:		File descriptor to remove
> - */
> -void epoll_del(int epollfd, int fd)
> -{
> -	epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL);
> -
> -}
> -
>  /**
>   * encode_domain_name() - Encode domain name according to RFC 1035, section 3.1
>   * @buf:		Buffer to fill in with encoded domain name
> diff --git a/util.h b/util.h
> index c61cbef357aa..8e4b4c5c6032 100644
> --- a/util.h
> +++ b/util.h
> @@ -193,6 +193,9 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  #define SNDBUF_BIG		(4ULL * 1024 * 1024)
>  #define SNDBUF_SMALL		(128ULL * 1024)
>  
> +#define FD_REF_BITS		24
> +#define FD_REF_MAX		((int)MAX_FROM_BITS(FD_REF_BITS))
> +
>  #include <net/if.h>
>  #include <limits.h>
>  #include <stdint.h>
> @@ -300,7 +303,6 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m)
>  #define FPRINTF(f, ...)	(void)fprintf(f, __VA_ARGS__)
>  
>  void raw_random(void *buf, size_t buflen);
> -void epoll_del(int epollfd, int fd);
>  
>  /*
>   * Starting from glibc 2.40.9000 and commit 25a5eb4010df ("string: strerror,
> diff --git a/vhost_user.c b/vhost_user.c
> index f8324c59cc6c..aea1e2cbcea5 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -32,8 +32,6 @@
>  #include <inttypes.h>
>  #include <time.h>
>  #include <net/ethernet.h>
> -#include <netinet/in.h>
> -#include <sys/epoll.h>
>  #include <sys/eventfd.h>
>  #include <sys/mman.h>
>  #include <linux/vhost_types.h>
> @@ -45,6 +43,7 @@
>  #include "vhost_user.h"
>  #include "pcap.h"
>  #include "migrate.h"
> +#include "epoll_ctl.h"
>  
>  /* vhost-user version we are compatible with */
>  #define VHOST_USER_VERSION 1
> @@ -753,11 +752,8 @@ static void vu_set_watch(const struct vu_dev *vdev, int idx)
>  		.fd = vdev->vq[idx].kick_fd,
>  		.queue = idx
>  	 };
> -	struct epoll_event ev = { 0 };
>  
> -	ev.data.u64 = ref.u64;
> -	ev.events = EPOLLIN;
> -	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
> +	epoll_add(vdev->context->epollfd, EPOLLIN, &ref);

And here.
>  }
>  
>  /**
> diff --git a/vu_common.c b/vu_common.c
> index b716070ea3c3..b13b7c308fd8 100644
> --- a/vu_common.c
> +++ b/vu_common.c
> @@ -6,7 +6,6 @@
>   */
>  
>  #include <errno.h>
> -#include <unistd.h>
>  #include <sys/uio.h>
>  #include <sys/eventfd.h>
>  #include <netinet/if_ether.h>
> @@ -19,6 +18,7 @@
>  #include "pcap.h"
>  #include "vu_common.h"
>  #include "migrate.h"
> +#include "epoll_ctl.h"
>  
>  #define VU_MAX_TX_BUFFER_NB	2
>  
> -- 
> 2.51.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common
  2025-10-17 10:31 ` [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common Laurent Vivier
  2025-10-17 17:43   ` Stefano Brivio
@ 2025-10-20  1:34   ` David Gibson
  2025-10-21 12:14     ` Laurent Vivier
  1 sibling, 1 reply; 27+ messages in thread
From: David Gibson @ 2025-10-20  1:34 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 16575 bytes --]

On Fri, Oct 17, 2025 at 12:31:26PM +0200, Laurent Vivier wrote:
> The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
> whether a connection was registered with epoll, not which epoll instance.
> This limited flexibility for future multi-epoll support.
> 
> Replace the boolean with a threadnb field in flow_common that identifies
> which thread (and thus which epoll instance) the flow is registered with.
> Use FLOW_THREADNB_INVALID to indicate when a flow is not registered with
> any epoll instance. A threadnb_to_epollfd[] mapping table translates
> thread numbers to their corresponding epoll file descriptors.
> 
> Add helper functions:
> - flow_in_epoll() to check if a flow is registered with epoll
> - flow_epollfd() to retrieve the epoll fd for a flow's thread
> - flow_thread_register() to register an epoll fd with a thread
> - flow_thread_set() to set the thread number of a flow
> 
> This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
> the need to pass the context 'c', since the epoll fd is now directly
> accessible from the flow structure via flow_epollfd().
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Some minor queries only.

> ---
>  flow.c       | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  flow.h       | 12 ++++++++++++
>  passt.c      |  1 +
>  tcp.c        | 39 ++++++++++++++++++++------------------
>  tcp_conn.h   |  8 +-------
>  tcp_splice.c | 24 ++++++++++++------------
>  6 files changed, 99 insertions(+), 38 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index b14e9d8b63ff..d56bae776239 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -116,6 +116,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>  unsigned flow_first_free;
>  union flow flowtab[FLOW_MAX];
>  static const union flow *flow_new_entry; /* = NULL */
> +static int threadnb_to_epollfd[FLOW_THREADNB_SIZE];

Same comments as in the previous version about "thread" in the name.

>  
>  /* Hash table to index it */
>  #define FLOW_HASH_LOAD		70		/* % */
> @@ -347,6 +348,55 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
>  	flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
>  }
>  
> +/**
> + * flow_in_epoll() - Check if flow is registered with an epoll instance
> + * @f:		Flow to check
> + *
> + * Return: true if flow is registered with epoll, false otherwise
> + */
> +bool flow_in_epoll(const struct flow_common *f)
> +{
> +	return f->threadnb != FLOW_THREADNB_INVALID;
> +}
> +
> +/**
> + * flow_epollfd() - Get the epoll file descriptor for a flow
> + * @f:		Flow to query
> + *
> + * Return: epoll file descriptor associated with the flow's thread
> + */
> +int flow_epollfd(const struct flow_common *f)
> +{
> +	ASSERT(f->threadnb < FLOW_THREADNB_MAX);
> +
> +	return threadnb_to_epollfd[f->threadnb];
> +}
> +
> +/**
> + * flow_thread_set() - Associate a flow with a thread
> + * @f:		Flow to update
> + * @threadnb:	Thread number to associate with this flow
> + */
> +void flow_thread_set(struct flow_common *f, int threadnb)
> +{
> +	ASSERT(threadnb < FLOW_THREADNB_MAX);
> +
> +	f->threadnb = threadnb;
> +}
> +
> +/**
> + * flow_thread_register() - Initialize the threadnb -> epollfd mapping
> + * @threadnb:	Thread number to associate to
> + * @epollfd:	epoll file descriptor for the thread
> + */
> +void flow_thread_register(int threadnb, int epollfd)
> +{
> +	ASSERT(threadnb < FLOW_THREADNB_MAX);
> +	ASSERT(epollfd >= 0);

Not sure about the second assert.  It seems like an error people are
unlikely to make by accident, whereas attempting to deliberately clear
/ delete an entry by setting its fd to -1 seems like a reasonable
thing to do.

> +
> +	threadnb_to_epollfd[threadnb] = epollfd;
> +}
> +
>  /**
>   * flow_initiate_() - Move flow to INI, setting pif[INISIDE]
>   * @flow:	Flow to change state
> @@ -548,6 +598,7 @@ union flow *flow_alloc(void)
>  
>  	flow_new_entry = flow;
>  	memset(flow, 0, sizeof(*flow));
> +	flow->f.threadnb = FLOW_THREADNB_INVALID;
>  	flow_set_state(&flow->f, FLOW_STATE_NEW);
>  
>  	return flow;
> @@ -827,7 +878,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  		case FLOW_TCP_SPLICE:
>  			closed = tcp_splice_flow_defer(&flow->tcp_splice);
>  			if (!closed && timer)
> -				tcp_splice_timer(c, &flow->tcp_splice);
> +				tcp_splice_timer(&flow->tcp_splice);
>  			break;
>  		case FLOW_PING4:
>  		case FLOW_PING6:
> diff --git a/flow.h b/flow.h
> index ef138b83add8..700d8b32c990 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -177,6 +177,8 @@ int flowside_connect(const struct ctx *c, int s,
>   * @type:	Type of packet flow
>   * @pif[]:	Interface for each side of the flow
>   * @side[]:	Information for each side of the flow
> + * @threadnb:	Thread number flow is registered with
> + *		(FLOW_THREADNB_INVALID if not)
>   */
>  struct flow_common {
>  #ifdef __GNUC__
> @@ -192,8 +194,14 @@ struct flow_common {
>  #endif
>  	uint8_t		pif[SIDES];
>  	struct flowside	side[SIDES];
> +#define FLOW_THREADNB_BITS 8
> +	unsigned int	threadnb:FLOW_THREADNB_BITS;
>  };
>  
> +#define FLOW_THREADNB_SIZE	(1 << FLOW_THREADNB_BITS)
> +#define FLOW_THREADNB_MAX	(FLOW_THREADNB_SIZE - 1)
> +#define FLOW_THREADNB_INVALID	FLOW_THREADNB_MAX
> +
>  #define FLOW_INDEX_BITS		17	/* 128k - 1 */
>  #define FLOW_MAX		MAX_FROM_BITS(FLOW_INDEX_BITS)
>  
> @@ -249,6 +257,10 @@ flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
>  union flow;
>  
>  void flow_init(void);
> +bool flow_in_epoll(const struct flow_common *f);
> +int flow_epollfd(const struct flow_common *f);
> +void flow_thread_set(struct flow_common *f, int threadnb);
> +void flow_thread_register(int threadnb, int epollfd);
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
>  int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
>  			      int fd);
> diff --git a/passt.c b/passt.c
> index af928111786b..37f2c897be84 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -285,6 +285,7 @@ int main(int argc, char **argv)
>  	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
>  	if (c.epollfd == -1)
>  		die_perror("Failed to create epoll file descriptor");
> +	flow_thread_register(0, c.epollfd);
>  
>  	if (getrlimit(RLIMIT_NOFILE, &limit))
>  		die_perror("Failed to get maximum value of open files limit");
> diff --git a/tcp.c b/tcp.c
> index db9f17c0622f..8c49852b8454 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -504,25 +504,27 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
>   */
>  static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
> -	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
> +	int m = flow_in_epoll(&conn->f) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>  	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
>  		                .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), };
>  	struct epoll_event ev = { .data.u64 = ref.u64 };
> +	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
> +						   : c->epollfd;
>  
>  	if (conn->events == CLOSED) {
> -		if (conn->in_epoll)
> -			epoll_del(c->epollfd, conn->sock);
> +		if (flow_in_epoll(&conn->f))
> +			epoll_del(epollfd, conn->sock);
>  		if (conn->timer != -1)
> -			epoll_del(c->epollfd, conn->timer);
> +			epoll_del(epollfd, conn->timer);
>  		return 0;
>  	}
>  
>  	ev.events = tcp_conn_epoll_events(conn->events, conn->flags);
>  
> -	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
> +	if (epoll_ctl(epollfd, m, conn->sock, &ev))
>  		return -errno;
>  
> -	conn->in_epoll = true;
> +	flow_thread_set(&conn->f, 0);
>  
>  	if (conn->timer != -1) {
>  		union epoll_ref ref_t = { .type = EPOLL_TYPE_TCP_TIMER,
> @@ -531,7 +533,8 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  		struct epoll_event ev_t = { .data.u64 = ref_t.u64,
>  					    .events = EPOLLIN | EPOLLET };
>  
> -		if (epoll_ctl(c->epollfd, EPOLL_CTL_MOD, conn->timer, &ev_t))
> +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_MOD,
> +			      conn->timer, &ev_t))
>  			return -errno;
>  	}
>  
> @@ -540,12 +543,11 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  
>  /**
>   * tcp_timer_ctl() - Set timerfd based on flags/events, create timerfd if needed
> - * @c:		Execution context
>   * @conn:	Connection pointer
>   *
>   * #syscalls timerfd_create timerfd_settime
>   */
> -static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> +static void tcp_timer_ctl(struct tcp_tap_conn *conn)
>  {
>  	struct itimerspec it = { { 0 }, { 0 } };
>  
> @@ -570,7 +572,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  		}
>  		conn->timer = fd;
>  
> -		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
> +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_ADD,
> +			      conn->timer, &ev)) {

Possibly a question for an earlier patch, but is there a reason we
can't use epoll_add() here?

>  			flow_dbg_perror(conn, "failed to add timer");
>  			close(conn->timer);
>  			conn->timer = -1;
> @@ -628,7 +631,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  			 * flags and factor this into the logic below.
>  			 */
>  			if (flag == ACK_FROM_TAP_DUE)
> -				tcp_timer_ctl(c, conn);
> +				tcp_timer_ctl(conn);
>  
>  			return;
>  		}
> @@ -644,7 +647,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  	if (flag == ACK_FROM_TAP_DUE || flag == ACK_TO_TAP_DUE		  ||
>  	    (flag == ~ACK_FROM_TAP_DUE && (conn->flags & ACK_TO_TAP_DUE)) ||
>  	    (flag == ~ACK_TO_TAP_DUE   && (conn->flags & ACK_FROM_TAP_DUE)))
> -		tcp_timer_ctl(c, conn);
> +		tcp_timer_ctl(conn);
>  }
>  
>  /**
> @@ -699,7 +702,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp_epoll_ctl(c, conn);
>  
>  	if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
> -		tcp_timer_ctl(c, conn);
> +		tcp_timer_ctl(conn);
>  }
>  
>  /**
> @@ -1757,7 +1760,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  				   seq, conn->seq_from_tap);
>  
>  			tcp_send_flag(c, conn, ACK);
> -			tcp_timer_ctl(c, conn);
> +			tcp_timer_ctl(conn);
>  
>  			if (p->count == 1) {
>  				tcp_tap_window_update(c, conn,
> @@ -2406,7 +2409,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>  
>  	if (conn->flags & ACK_TO_TAP_DUE) {
>  		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> -		tcp_timer_ctl(c, conn);
> +		tcp_timer_ctl(conn);
>  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>  		if (!(conn->events & ESTABLISHED)) {
>  			flow_dbg(conn, "handshake timeout");
> @@ -2428,7 +2431,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>  				return;
>  
>  			tcp_data_from_sock(c, conn);
> -			tcp_timer_ctl(c, conn);
> +			tcp_timer_ctl(conn);
>  		}
>  	} else {
>  		struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
> @@ -3476,7 +3479,7 @@ int tcp_flow_migrate_source_ext(const struct ctx *c,
>  	if (c->migrate_no_linger)
>  		close(s);
>  	else
> -		epoll_del(c->epollfd, s);
> +		epoll_del(flow_epollfd(&conn->f), s);
>  
>  	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
>  	 * based on the end of the queues.
> @@ -3625,7 +3628,7 @@ static int tcp_flow_repair_connect(const struct ctx *c,
>  		return rc;
>  	}
>  
> -	conn->in_epoll = 0;
> +	conn->f.threadnb = FLOW_THREADNB_INVALID;
>  	conn->timer = -1;
>  	conn->listening_sock = -1;
>  
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 38b5c541f003..81333122d531 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -12,7 +12,6 @@
>  /**
>   * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
>   * @f:			Generic flow information
> - * @in_epoll:		Is the connection in the epoll set?
>   * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
>   * @ws_from_tap:	Window scaling factor advertised from tap/guest
>   * @ws_to_tap:		Window scaling factor advertised to tap/guest
> @@ -36,8 +35,6 @@ struct tcp_tap_conn {
>  	/* Must be first element */
>  	struct flow_common f;
>  
> -	bool		in_epoll	:1;
> -
>  #define TCP_RETRANS_BITS		3
>  	unsigned int	retrans		:TCP_RETRANS_BITS;
>  #define TCP_MAX_RETRANS			MAX_FROM_BITS(TCP_RETRANS_BITS)
> @@ -196,7 +193,6 @@ struct tcp_tap_transfer_ext {
>   * @written:		Bytes written (not fully written from one other side read)
>   * @events:		Events observed/actions performed on connection
>   * @flags:		Connection flags (attributes, not events)
> - * @in_epoll:		Is the connection in the epoll set?
>   */
>  struct tcp_splice_conn {
>  	/* Must be first element */
> @@ -220,8 +216,6 @@ struct tcp_splice_conn {
>  #define RCVLOWAT_SET(sidei_)		((sidei_) ? BIT(1) : BIT(0))
>  #define RCVLOWAT_ACT(sidei_)		((sidei_) ? BIT(3) : BIT(2))
>  #define CLOSING				BIT(4)
> -
> -	bool in_epoll	:1;
>  };
>  
>  /* Socket pools */
> @@ -245,7 +239,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
>  bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
>  
>  bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
> -void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
> +void tcp_splice_timer(struct tcp_splice_conn *conn);
>  int tcp_conn_pool_sock(int pool[]);
>  int tcp_conn_sock(sa_family_t af);
>  int tcp_sock_refill_pool(int pool[], sa_family_t af);
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 6f21184bdc55..703bd7610890 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -149,7 +149,9 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
>  static int tcp_splice_epoll_ctl(const struct ctx *c,
>  				struct tcp_splice_conn *conn)
>  {
> -	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
> +	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
> +						   : c->epollfd;
> +	int m = flow_in_epoll(&conn->f) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>  	const union epoll_ref ref[SIDES] = {
>  		{ .type = EPOLL_TYPE_TCP_SPLICE, .fd = conn->s[0],
>  		  .flowside = FLOW_SIDX(conn, 0) },
> @@ -161,25 +163,24 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
>  
>  	tcp_splice_conn_epoll_events(conn->events, ev);
>  
> -	if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) ||
> -	    epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) {
> +
> +	if (epoll_ctl(epollfd, m, conn->s[0], &ev[0]) ||
> +	    epoll_ctl(epollfd, m, conn->s[1], &ev[1])) {
>  		int ret = -errno;
>  		flow_perror(conn, "ERROR on epoll_ctl()");
>  		return ret;
>  	}
> -
> -	conn->in_epoll = true;
> +	flow_thread_set(&conn->f, 0);
>  
>  	return 0;
>  }
>  
>  /**
>   * conn_flag_do() - Set/unset given flag, log, update epoll on CLOSING flag
> - * @c:		Execution context
>   * @conn:	Connection pointer
>   * @flag:	Flag to set, or ~flag to unset
>   */
> -static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
> +static void conn_flag_do(struct tcp_splice_conn *conn,
>  			 unsigned long flag)
>  {
>  	if (flag & (flag - 1)) {
> @@ -204,15 +205,15 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
>  	}
>  
>  	if (flag == CLOSING) {
> -		epoll_del(c->epollfd, conn->s[0]);
> -		epoll_del(c->epollfd, conn->s[1]);
> +		epoll_del(flow_epollfd(&conn->f), conn->s[0]);
> +		epoll_del(flow_epollfd(&conn->f), conn->s[1]);
>  	}
>  }
>  
>  #define conn_flag(c, conn, flag)					\
>  	do {								\
>  		flow_trace(conn, "flag at %s:%i", __func__, __LINE__);	\
> -		conn_flag_do(c, conn, flag);				\
> +		conn_flag_do(conn, flag);				\
>  	} while (0)
>  
>  /**
> @@ -751,10 +752,9 @@ void tcp_splice_init(struct ctx *c)
>  
>  /**
>   * tcp_splice_timer() - Timer for spliced connections
> - * @c:		Execution context
>   * @conn:	Connection to handle
>   */
> -void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn)
> +void tcp_splice_timer(struct tcp_splice_conn *conn)
>  {
>  	unsigned sidei;
>  
> -- 
> 2.51.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 5/7] icmp: Use thread-based epoll management for ICMP flows
  2025-10-17 10:31 ` [PATCH v4 5/7] icmp: Use thread-based epoll management for ICMP flows Laurent Vivier
@ 2025-10-20  1:35   ` David Gibson
  0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2025-10-20  1:35 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1728 bytes --]

On Fri, Oct 17, 2025 at 12:31:27PM +0200, Laurent Vivier wrote:
> Store the thread number in the flow_common structure for ICMP ping
> flows using flow_epollfd_set() and retrieve the corresponding epoll
> file descriptor with flow_epollfd_get() instead of passing c->epollfd
> directly. This makes ICMP consistent with the recent TCP changes and
> follows the pattern established in previous commit.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  icmp.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/icmp.c b/icmp.c
> index 56dfac6c958e..baddd8e5aacb 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -149,7 +149,7 @@ unexpected:
>  static void icmp_ping_close(const struct ctx *c,
>  			    const struct icmp_ping_flow *pingf)
>  {
> -	epoll_del(c->epollfd, pingf->sock);
> +	epoll_del(flow_epollfd(&pingf->f), pingf->sock);
>  	close(pingf->sock);
>  	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
>  }
> @@ -206,11 +206,13 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
>  	if (pingf->sock > FD_REF_MAX)
>  		goto cancel;
>  
> +	flow_thread_set(&pingf->f, 0);
> +
>  	ref.type = EPOLL_TYPE_PING;
>  	ref.flowside = FLOW_SIDX(flow, TGTSIDE);
>  	ref.fd = pingf->sock;
>  
> -	if (epoll_add(c->epollfd, EPOLLIN, &ref) < 0) {
> +	if (epoll_add(flow_epollfd(&pingf->f), EPOLLIN, &ref) < 0) {
>  		close(pingf->sock);
>  		goto cancel;
>  	}
> -- 
> 2.51.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 6/7] udp: Use thread-based epoll management for UDP flows
  2025-10-17 10:31 ` [PATCH v4 6/7] udp: Use thread-based epoll management for UDP flows Laurent Vivier
@ 2025-10-20  1:39   ` David Gibson
  0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2025-10-20  1:39 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2060 bytes --]

On Fri, Oct 17, 2025 at 12:31:28PM +0200, Laurent Vivier wrote:
> Store the thread number in the flow_common structure for UDP
> flows using flow_epollfd_set() and retrieve the corresponding epoll
> file descriptor with flow_epollfd_get() instead of passing c->epollfd
> directly. This makes UDP consistent with the recent TCP and ICMP
> changes.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  udp_flow.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/udp_flow.c b/udp_flow.c
> index caaf3dababb1..ea89a14775a4 100644
> --- a/udp_flow.c
> +++ b/udp_flow.c
> @@ -52,7 +52,7 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
>  	flow_foreach_sidei(sidei) {
>  		flow_hash_remove(c, FLOW_SIDX(uflow, sidei));
>  		if (uflow->s[sidei] >= 0) {
> -			epoll_del(c->epollfd, uflow->s[sidei]);
> +			epoll_del(flow_epollfd(&uflow->f), uflow->s[sidei]);
>  			close(uflow->s[sidei]);
>  			uflow->s[sidei] = -1;

Hm.  I guess it *probably* doesn't matter since the flow's about to go
away.  But just as we clear the socket fields, here we should probably
clear the threadnb field when we remove ourselves from the epoll.

>  		}
> @@ -92,7 +92,9 @@ static int udp_flow_sock(const struct ctx *c,
>  	ref.data = fref.data;
>  	ref.fd = s;
>  
> -	rc = epoll_add(c->epollfd, EPOLLIN, &ref);
> +	flow_thread_set(&uflow->f, 0);
> +
> +	rc = epoll_add(flow_epollfd(&uflow->f), EPOLLIN, &ref);
>  	if (rc < 0) {
>  		close(s);
>  		return rc;
> @@ -101,7 +103,7 @@ static int udp_flow_sock(const struct ctx *c,
>  	if (flowside_connect(c, s, pif, side) < 0) {
>  		rc = -errno;
>  
> -		epoll_del(c->epollfd, s);
> +		epoll_del(flow_epollfd(&uflow->f), s);
>  		close(s);
>  
>  		flow_dbg_perror(uflow, "Couldn't connect flow socket");
> -- 
> 2.51.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 7/7] passt: Move main event loop processing into passt_worker()
  2025-10-17 10:31 ` [PATCH v4 7/7] passt: Move main event loop processing into passt_worker() Laurent Vivier
  2025-10-17 17:43   ` Stefano Brivio
@ 2025-10-20  1:43   ` David Gibson
  2025-10-21  8:00     ` Laurent Vivier
  1 sibling, 1 reply; 27+ messages in thread
From: David Gibson @ 2025-10-20  1:43 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6973 bytes --]

On Fri, Oct 17, 2025 at 12:31:29PM +0200, Laurent Vivier wrote:
> Extract the epoll event processing logic from main() into a separate
> passt_worker() function. This refactoring prepares the code for future
> threading support where passt_worker() will be called as a worker thread
> callback.
> 
> The new function handles:
> - Processing epoll events and dispatching to protocol handlers
> - Event statistics tracking and printing
> - Post-handler periodic tasks (timers, deferred work)
> - Migration handling
> 
> No functional changes, purely a code restructuring.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Looks good as far as it goes, and I've though often in the past that
it would make more sense for the "engine" to go in its own function.

Wondering if it would make more sense to include the epoll_wait()
itself and the loop in this function, rather than leaving that
outside.

> ---
>  passt.c | 160 +++++++++++++++++++++++++++++++-------------------------
>  1 file changed, 88 insertions(+), 72 deletions(-)
> 
> diff --git a/passt.c b/passt.c
> index 37f2c897be84..5bfa4c6353d9 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -229,6 +229,92 @@ static void print_stats(const struct ctx *c, const struct passt_stats *stats,
>  	lines_printed++;
>  }
>  
> +/**
> + * passt_worker() - Process epoll events and handle protocol operations
> + * @opaque:	Pointer to execution context (struct ctx)
> + * @nfds:	Number of file descriptors ready (epoll_wait return value)
> + * @events:	epoll_event array of ready file descriptors
> + */
> +static void passt_worker(void *opaque, int nfds,  struct epoll_event *events)
> +{
> +	static struct passt_stats stats = { 0 };
> +	struct ctx *c = opaque;
> +	struct timespec now;
> +	int i;
> +
> +	if (clock_gettime(CLOCK_MONOTONIC, &now))
> +		err_perror("Failed to get CLOCK_MONOTONIC time");
> +
> +	for (i = 0; i < nfds; i++) {
> +		union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
> +		uint32_t eventmask = events[i].events;
> +
> +		trace("%s: epoll event on %s %i (events: 0x%08x)",
> +		      c->mode == MODE_PASTA ? "pasta" : "passt",
> +		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> +
> +		switch (ref.type) {
> +		case EPOLL_TYPE_TAP_PASTA:
> +			tap_handler_pasta(c, eventmask, &now);
> +			break;
> +		case EPOLL_TYPE_TAP_PASST:
> +			tap_handler_passt(c, eventmask, &now);
> +			break;
> +		case EPOLL_TYPE_TAP_LISTEN:
> +			tap_listen_handler(c, eventmask);
> +			break;
> +		case EPOLL_TYPE_NSQUIT_INOTIFY:
> +			pasta_netns_quit_inotify_handler(c, ref.fd);
> +			break;
> +		case EPOLL_TYPE_NSQUIT_TIMER:
> +			pasta_netns_quit_timer_handler(c, ref);
> +			break;
> +		case EPOLL_TYPE_TCP:
> +			tcp_sock_handler(c, ref, eventmask);
> +			break;
> +		case EPOLL_TYPE_TCP_SPLICE:
> +			tcp_splice_sock_handler(c, ref, eventmask);
> +			break;
> +		case EPOLL_TYPE_TCP_LISTEN:
> +			tcp_listen_handler(c, ref, &now);
> +			break;
> +		case EPOLL_TYPE_TCP_TIMER:
> +			tcp_timer_handler(c, ref);
> +			break;
> +		case EPOLL_TYPE_UDP_LISTEN:
> +			udp_listen_sock_handler(c, ref, eventmask, &now);
> +			break;
> +		case EPOLL_TYPE_UDP:
> +			udp_sock_handler(c, ref, eventmask, &now);
> +			break;
> +		case EPOLL_TYPE_PING:
> +			icmp_sock_handler(c, ref);
> +			break;
> +		case EPOLL_TYPE_VHOST_CMD:
> +			vu_control_handler(c->vdev, c->fd_tap, eventmask);
> +			break;
> +		case EPOLL_TYPE_VHOST_KICK:
> +			vu_kick_cb(c->vdev, ref, &now);
> +			break;
> +		case EPOLL_TYPE_REPAIR_LISTEN:
> +			repair_listen_handler(c, eventmask);
> +			break;
> +		case EPOLL_TYPE_REPAIR:
> +			repair_handler(c, eventmask);
> +			break;
> +		default:
> +			/* Can't happen */
> +			ASSERT(0);
> +		}
> +		stats.events[ref.type]++;
> +		print_stats(c, &stats, &now);
> +	}
> +
> +	post_handler(c, &now);
> +
> +	migrate_handler(c);
> +}
> +
>  /**
>   * main() - Entry point and main loop
>   * @argc:	Argument count
> @@ -246,8 +332,7 @@ static void print_stats(const struct ctx *c, const struct passt_stats *stats,
>  int main(int argc, char **argv)
>  {
>  	struct epoll_event events[NUM_EPOLL_EVENTS];
> -	struct passt_stats stats = { 0 };
> -	int nfds, i, devnull_fd = -1;
> +	int nfds, devnull_fd = -1;
>  	struct ctx c = { 0 };
>  	struct rlimit limit;
>  	struct timespec now;
> @@ -355,77 +440,8 @@ loop:
>  	if (nfds == -1 && errno != EINTR)
>  		die_perror("epoll_wait() failed in main loop");
>  
> -	if (clock_gettime(CLOCK_MONOTONIC, &now))
> -		err_perror("Failed to get CLOCK_MONOTONIC time");
> -
> -	for (i = 0; i < nfds; i++) {
> -		union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
> -		uint32_t eventmask = events[i].events;
> -
> -		trace("%s: epoll event on %s %i (events: 0x%08x)",
> -		      c.mode == MODE_PASTA ? "pasta" : "passt",
> -		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> -
> -		switch (ref.type) {
> -		case EPOLL_TYPE_TAP_PASTA:
> -			tap_handler_pasta(&c, eventmask, &now);
> -			break;
> -		case EPOLL_TYPE_TAP_PASST:
> -			tap_handler_passt(&c, eventmask, &now);
> -			break;
> -		case EPOLL_TYPE_TAP_LISTEN:
> -			tap_listen_handler(&c, eventmask);
> -			break;
> -		case EPOLL_TYPE_NSQUIT_INOTIFY:
> -			pasta_netns_quit_inotify_handler(&c, ref.fd);
> -			break;
> -		case EPOLL_TYPE_NSQUIT_TIMER:
> -			pasta_netns_quit_timer_handler(&c, ref);
> -			break;
> -		case EPOLL_TYPE_TCP:
> -			tcp_sock_handler(&c, ref, eventmask);
> -			break;
> -		case EPOLL_TYPE_TCP_SPLICE:
> -			tcp_splice_sock_handler(&c, ref, eventmask);
> -			break;
> -		case EPOLL_TYPE_TCP_LISTEN:
> -			tcp_listen_handler(&c, ref, &now);
> -			break;
> -		case EPOLL_TYPE_TCP_TIMER:
> -			tcp_timer_handler(&c, ref);
> -			break;
> -		case EPOLL_TYPE_UDP_LISTEN:
> -			udp_listen_sock_handler(&c, ref, eventmask, &now);
> -			break;
> -		case EPOLL_TYPE_UDP:
> -			udp_sock_handler(&c, ref, eventmask, &now);
> -			break;
> -		case EPOLL_TYPE_PING:
> -			icmp_sock_handler(&c, ref);
> -			break;
> -		case EPOLL_TYPE_VHOST_CMD:
> -			vu_control_handler(c.vdev, c.fd_tap, eventmask);
> -			break;
> -		case EPOLL_TYPE_VHOST_KICK:
> -			vu_kick_cb(c.vdev, ref, &now);
> -			break;
> -		case EPOLL_TYPE_REPAIR_LISTEN:
> -			repair_listen_handler(&c, eventmask);
> -			break;
> -		case EPOLL_TYPE_REPAIR:
> -			repair_handler(&c, eventmask);
> -			break;
> -		default:
> -			/* Can't happen */
> -			ASSERT(0);
> -		}
> -		stats.events[ref.type]++;
> -		print_stats(&c, &stats, &now);
> -	}
> -
> -	post_handler(&c, &now);
>  
> -	migrate_handler(&c);
> +	passt_worker(&c, nfds, events);
>  
>  	goto loop;
>  }
> -- 
> 2.51.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 7/7] passt: Move main event loop processing into passt_worker()
  2025-10-20  1:43   ` David Gibson
@ 2025-10-21  8:00     ` Laurent Vivier
  2025-10-22  0:53       ` David Gibson
  0 siblings, 1 reply; 27+ messages in thread
From: Laurent Vivier @ 2025-10-21  8:00 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 20/10/2025 03:43, David Gibson wrote:
> On Fri, Oct 17, 2025 at 12:31:29PM +0200, Laurent Vivier wrote:
>> Extract the epoll event processing logic from main() into a separate
>> passt_worker() function. This refactoring prepares the code for future
>> threading support where passt_worker() will be called as a worker thread
>> callback.
>>
>> The new function handles:
>> - Processing epoll events and dispatching to protocol handlers
>> - Event statistics tracking and printing
>> - Post-handler periodic tasks (timers, deferred work)
>> - Migration handling
>>
>> No functional changes, purely a code restructuring.
>>
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> 
> Looks good as far as it goes, and I've though often in the past that
> it would make more sense for the "engine" to go in its own function.
> 
> Wondering if it would make more sense to include the epoll_wait()
> itself and the loop in this function, rather than leaving that
> outside.
> 

When I introduce the multithreading and the multiqueue, as the thread is driven by the 
epollfd, the events are managed by the multiqueue part and the epollfd by the multithread 
part.

The "threading" worker is:

static void *threading_worker(void *opaque)
{
         struct threading_context *tc = opaque;

         while (true) {
                 struct epoll_event events[NUM_EPOLL_EVENTS];
                 int nfds;

                 nfds = epoll_wait(tc->epollfd, events, NUM_EPOLL_EVENTS,
                                   TIMER_INTERVAL);
                 if (nfds == -1 && errno != EINTR)
                         die_perror("epoll_wait() failed");

                 tc->worker(tc->opaque, nfds, events);
         }

         return NULL;
}

And the passt worker is registered with:

         threading_worker_set(0, passt_worker, NULL, &c);

where:

int threading_worker_set(unsigned int threadnb,
                          void (*worker)(void *, int, struct epoll_event *),
                          bool (*is_valid)(void *, unsigned int criteria),
                          void *opaque)
{
         struct threading_context *tc;

         if (threadnb >= ARRAY_SIZE(threads))
                 return -1;

         tc = &threads[threadnb];

         tc->worker = worker;
         tc->is_valid = is_valid;
         tc->opaque = opaque;

         return 0;
}

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/7] epoll_ctl: Extract epoll operations
  2025-10-20  1:20   ` David Gibson
@ 2025-10-21 11:52     ` Laurent Vivier
  2025-10-22  0:58       ` David Gibson
  0 siblings, 1 reply; 27+ messages in thread
From: Laurent Vivier @ 2025-10-21 11:52 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 20/10/2025 03:20, David Gibson wrote:
> On Fri, Oct 17, 2025 at 12:31:24PM +0200, Laurent Vivier wrote:
>> Centralize epoll_add() and epoll_del() helper functions into new
>> epoll_ctl.c/h files.
>>
>> This also moves the union epoll_ref definition from passt.h to
>> epoll_ctl.h where it's more logically placed.
>>
>> The new epoll_add() helper simplifies adding file descriptors to epoll
>> by taking an epoll_ref and events, handling error reporting
>> consistently across all call sites.
>>
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> 
> Concept looks good, some minor details in execution.
> 
>> ---
>>   Makefile     | 22 +++++++++++-----------
>>   epoll_ctl.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   epoll_ctl.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   icmp.c       |  4 +---
>>   passt.c      |  2 +-
>>   passt.h      | 34 ----------------------------------
>>   pasta.c      |  7 +++----
>>   repair.c     | 14 +++++---------
>>   tap.c        | 13 ++++---------
>>   tcp.c        |  2 +-
>>   tcp_splice.c |  2 +-
>>   udp.c        |  2 +-
>>   udp_flow.c   |  1 +
>>   util.c       | 22 +++-------------------
>>   util.h       |  4 +++-
>>   vhost_user.c |  8 ++------
>>   vu_common.c  |  2 +-
>>   17 files changed, 134 insertions(+), 101 deletions(-)
>>   create mode 100644 epoll_ctl.c
>>   create mode 100644 epoll_ctl.h
...
>> @@ -1327,14 +1327,12 @@ static void tap_backend_show_hints(struct ctx *c)
>>   static void tap_sock_unix_init(const struct ctx *c)
>>   {
>>   	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
>> -	struct epoll_event ev = { 0 };
>>   
>>   	listen(c->fd_tap_listen, 0);
>>   
>>   	ref.fd = c->fd_tap_listen;
>> -	ev.events = EPOLLIN | EPOLLET;
>> -	ev.data.u64 = ref.u64;
>> -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
>> +
>> +	epoll_add(c->epollfd, EPOLLIN | EPOLLET, &ref);
> 
> Preexisting, but we should probably check for errors here.
> 

To do what? die() ? err() to report the place where the error happens?

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common
  2025-10-20  1:34   ` David Gibson
@ 2025-10-21 12:14     ` Laurent Vivier
  2025-10-22  1:00       ` David Gibson
  0 siblings, 1 reply; 27+ messages in thread
From: Laurent Vivier @ 2025-10-21 12:14 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 20/10/2025 03:34, David Gibson wrote:
>> @@ -570,7 +572,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>>   		}
>>   		conn->timer = fd;
>>   
>> -		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
>> +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_ADD,
>> +			      conn->timer, &ev)) {
> Possibly a question for an earlier patch, but is there a reason we
> can't use epoll_add() here?
> 

Yes, the fd we use in epoll_ctl() and the fd stored in ref are not the same.

The fd in ref is conn->sock (epoll_add() takes fd from ref) but in epoll_ctl() we add 
conn->timer (the fd from timerfd_create()).

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common
  2025-10-17 17:43   ` Stefano Brivio
@ 2025-10-21 13:13     ` Laurent Vivier
  0 siblings, 0 replies; 27+ messages in thread
From: Laurent Vivier @ 2025-10-21 13:13 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, David Gibson

On 17/10/2025 19:43, Stefano Brivio wrote:
> Nits only:
> 
> On Fri, 17 Oct 2025 12:31:26 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
>> The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
>> whether a connection was registered with epoll, not which epoll instance.
>> This limited flexibility for future multi-epoll support.
>>
>> Replace the boolean with a threadnb field in flow_common that identifies
>> which thread (and thus which epoll instance) the flow is registered with.
>> Use FLOW_THREADNB_INVALID to indicate when a flow is not registered with
>> any epoll instance. A threadnb_to_epollfd[] mapping table translates
>> thread numbers to their corresponding epoll file descriptors.
> 
> It wasn't really clear to me until I actually started digging into this
> change what NB meant there. For me it's "notifier block" before
> "number", but that didn't make sense either.
> 
> Some ideas (even though maybe as David suggested this name
> shouldn't have "thread" in it at all):
> 
> - f->thread / FLOW_THREAD_INVALID
> 
> - f->thread_id / FLOW_THREAD_ID_INVALID
> 
> ...or f->epoll_id, reflecting David's observation?

epoll_id seems reasonable.

> 
>> Add helper functions:
>> - flow_in_epoll() to check if a flow is registered with epoll
>> - flow_epollfd() to retrieve the epoll fd for a flow's thread
>> - flow_thread_register() to register an epoll fd with a thread
>> - flow_thread_set() to set the thread number of a flow
>>
>> This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
>> the need to pass the context 'c', since the epoll fd is now directly
>> accessible from the flow structure via flow_epollfd().
>>
>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>> ---
>>   flow.c       | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>>   flow.h       | 12 ++++++++++++
>>   passt.c      |  1 +
>>   tcp.c        | 39 ++++++++++++++++++++------------------
>>   tcp_conn.h   |  8 +-------
>>   tcp_splice.c | 24 ++++++++++++------------
>>   6 files changed, 99 insertions(+), 38 deletions(-)
>>
>> diff --git a/flow.c b/flow.c
>> index b14e9d8b63ff..d56bae776239 100644
>> --- a/flow.c
>> +++ b/flow.c
>> @@ -116,6 +116,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>>   unsigned flow_first_free;
>>   union flow flowtab[FLOW_MAX];
>>   static const union flow *flow_new_entry; /* = NULL */
> 
> It would be nice to have an idea of how this table is organised without
> looking at how it's used, say:
> 
> /* Table of epoll file descriptors, indexed by thread number */
> 
> ...or indexed by "epoll identifiers", perhaps, if we want to drop
> references to threads here.

I will drop reference to threads.

> 
>> +static int threadnb_to_epollfd[FLOW_THREADNB_SIZE];
>>   
>>   /* Hash table to index it */
>>   #define FLOW_HASH_LOAD		70		/* % */
>> @@ -347,6 +348,55 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
>>   	flow_log_details_(f, LOG_DEBUG, MAX(state, oldstate));
>>   }
>>   
>> +/**
>> + * flow_in_epoll() - Check if flow is registered with an epoll instance
>> + * @f:		Flow to check
>> + *
>> + * Return: true if flow is registered with epoll, false otherwise
>> + */
>> +bool flow_in_epoll(const struct flow_common *f)
>> +{
>> +	return f->threadnb != FLOW_THREADNB_INVALID;
>> +}
>> +
>> +/**
>> + * flow_epollfd() - Get the epoll file descriptor for a flow
>> + * @f:		Flow to query
>> + *
>> + * Return: epoll file descriptor associated with the flow's thread
>> + */
>> +int flow_epollfd(const struct flow_common *f)
>> +{
>> +	ASSERT(f->threadnb < FLOW_THREADNB_MAX);
>> +
>> +	return threadnb_to_epollfd[f->threadnb];
>> +}
>> +
>> +/**
>> + * flow_thread_set() - Associate a flow with a thread
>> + * @f:		Flow to update
>> + * @threadnb:	Thread number to associate with this flow
>> + */
>> +void flow_thread_set(struct flow_common *f, int threadnb)
>> +{
>> +	ASSERT(threadnb < FLOW_THREADNB_MAX);
>> +
>> +	f->threadnb = threadnb;
>> +}
>> +
>> +/**
>> + * flow_thread_register() - Initialize the threadnb -> epollfd mapping
>> + * @threadnb:	Thread number to associate to
>> + * @epollfd:	epoll file descriptor for the thread
>> + */
>> +void flow_thread_register(int threadnb, int epollfd)
>> +{
>> +	ASSERT(threadnb < FLOW_THREADNB_MAX);
>> +	ASSERT(epollfd >= 0);
>> +
>> +	threadnb_to_epollfd[threadnb] = epollfd;
>> +}
>> +
>>   /**
>>    * flow_initiate_() - Move flow to INI, setting pif[INISIDE]
>>    * @flow:	Flow to change state
>> @@ -548,6 +598,7 @@ union flow *flow_alloc(void)
>>   
>>   	flow_new_entry = flow;
>>   	memset(flow, 0, sizeof(*flow));
>> +	flow->f.threadnb = FLOW_THREADNB_INVALID;
>>   	flow_set_state(&flow->f, FLOW_STATE_NEW);
>>   
>>   	return flow;
>> @@ -827,7 +878,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>>   		case FLOW_TCP_SPLICE:
>>   			closed = tcp_splice_flow_defer(&flow->tcp_splice);
>>   			if (!closed && timer)
>> -				tcp_splice_timer(c, &flow->tcp_splice);
>> +				tcp_splice_timer(&flow->tcp_splice);
>>   			break;
>>   		case FLOW_PING4:
>>   		case FLOW_PING6:
>> diff --git a/flow.h b/flow.h
>> index ef138b83add8..700d8b32c990 100644
>> --- a/flow.h
>> +++ b/flow.h
>> @@ -177,6 +177,8 @@ int flowside_connect(const struct ctx *c, int s,
>>    * @type:	Type of packet flow
>>    * @pif[]:	Interface for each side of the flow
>>    * @side[]:	Information for each side of the flow
>> + * @threadnb:	Thread number flow is registered with
>> + *		(FLOW_THREADNB_INVALID if not)
> 
> You could phrase this more directly, say, "thread identifier, or
> FLOW_THREAD_INVALID", or "epollfd identifier, or EPOLLFD_ID_INVALID".
> 
> This is a structure describing flows, it's not surprising it's about
> the flow.
> 
>>    */
>>   struct flow_common {
>>   #ifdef __GNUC__
>> @@ -192,8 +194,14 @@ struct flow_common {
>>   #endif
>>   	uint8_t		pif[SIDES];
>>   	struct flowside	side[SIDES];
>> +#define FLOW_THREADNB_BITS 8
>> +	unsigned int	threadnb:FLOW_THREADNB_BITS;
>>   };
>>   
>> +#define FLOW_THREADNB_SIZE	(1 << FLOW_THREADNB_BITS)
>> +#define FLOW_THREADNB_MAX	(FLOW_THREADNB_SIZE - 1)
>> +#define FLOW_THREADNB_INVALID	FLOW_THREADNB_MAX
>> +
>>   #define FLOW_INDEX_BITS		17	/* 128k - 1 */
>>   #define FLOW_MAX		MAX_FROM_BITS(FLOW_INDEX_BITS)
>>   
>> @@ -249,6 +257,10 @@ flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif,
>>   union flow;
>>   
>>   void flow_init(void);
>> +bool flow_in_epoll(const struct flow_common *f);
>> +int flow_epollfd(const struct flow_common *f);
>> +void flow_thread_set(struct flow_common *f, int threadnb);
>> +void flow_thread_register(int threadnb, int epollfd);
>>   void flow_defer_handler(const struct ctx *c, const struct timespec *now);
>>   int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
>>   			      int fd);
>> diff --git a/passt.c b/passt.c
>> index af928111786b..37f2c897be84 100644
>> --- a/passt.c
>> +++ b/passt.c
>> @@ -285,6 +285,7 @@ int main(int argc, char **argv)
>>   	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
>>   	if (c.epollfd == -1)
>>   		die_perror("Failed to create epoll file descriptor");
>> +	flow_thread_register(0, c.epollfd);
> 
> Temporary, I suppose. If not, maybe it deserves its own constant, such
> as EPOLLFD_ID_DEFAULT?

I agree

> 
>>   	if (getrlimit(RLIMIT_NOFILE, &limit))
>>   		die_perror("Failed to get maximum value of open files limit");
>> diff --git a/tcp.c b/tcp.c
>> index db9f17c0622f..8c49852b8454 100644
>> --- a/tcp.c
>> +++ b/tcp.c
>> @@ -504,25 +504,27 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
>>    */
>>   static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>>   {
>> -	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>> +	int m = flow_in_epoll(&conn->f) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>>   	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
>>   		                .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), };
>>   	struct epoll_event ev = { .data.u64 = ref.u64 };
>> +	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
>> +						   : c->epollfd;
> 
> We usually align the second expression to the ternary operator, see
> also example in sock_l4_sa(), say:
> 
> 	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
> 					      : c->epollfd;
> 

I do like this usually. Missed this one when I change the name of the function.

>>   
>>   	if (conn->events == CLOSED) {
>> -		if (conn->in_epoll)
>> -			epoll_del(c->epollfd, conn->sock);
>> +		if (flow_in_epoll(&conn->f))
>> +			epoll_del(epollfd, conn->sock);
>>   		if (conn->timer != -1)
>> -			epoll_del(c->epollfd, conn->timer);
>> +			epoll_del(epollfd, conn->timer);
>>   		return 0;
>>   	}
>>   
>>   	ev.events = tcp_conn_epoll_events(conn->events, conn->flags);
>>   
>> -	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
>> +	if (epoll_ctl(epollfd, m, conn->sock, &ev))
>>   		return -errno;
>>   
>> -	conn->in_epoll = true;
>> +	flow_thread_set(&conn->f, 0);
> 
> Not temporary I guess, maybe it's an EPOLLFD_ID_DEFAULT?

I agree

> 
>>   
>>   	if (conn->timer != -1) {
>>   		union epoll_ref ref_t = { .type = EPOLL_TYPE_TCP_TIMER,
>> @@ -531,7 +533,8 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>>   		struct epoll_event ev_t = { .data.u64 = ref_t.u64,
>>   					    .events = EPOLLIN | EPOLLET };
>>   
>> -		if (epoll_ctl(c->epollfd, EPOLL_CTL_MOD, conn->timer, &ev_t))
>> +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_MOD,
>> +			      conn->timer, &ev_t))
>>   			return -errno;
>>   	}
>>   
>> @@ -540,12 +543,11 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>>   
>>   /**
>>    * tcp_timer_ctl() - Set timerfd based on flags/events, create timerfd if needed
>> - * @c:		Execution context
>>    * @conn:	Connection pointer
>>    *
>>    * #syscalls timerfd_create timerfd_settime
>>    */
>> -static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>> +static void tcp_timer_ctl(struct tcp_tap_conn *conn)
>>   {
>>   	struct itimerspec it = { { 0 }, { 0 } };
>>   
>> @@ -570,7 +572,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>>   		}
>>   		conn->timer = fd;
>>   
>> -		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
>> +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_ADD,
> 
> I wouldn't find it outrageous if you assigned an epollfd local variable
> first so that this doesn't need two lines.

I will do...

> 
>> +			      conn->timer, &ev)) {
>>   			flow_dbg_perror(conn, "failed to add timer");
>>   			close(conn->timer);
>>   			conn->timer = -1;
>> @@ -628,7 +631,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
>>   			 * flags and factor this into the logic below.
>>   			 */
>>   			if (flag == ACK_FROM_TAP_DUE)
>> -				tcp_timer_ctl(c, conn);
>> +				tcp_timer_ctl(conn);
>>   
>>   			return;
>>   		}
>> @@ -644,7 +647,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
>>   	if (flag == ACK_FROM_TAP_DUE || flag == ACK_TO_TAP_DUE		  ||
>>   	    (flag == ~ACK_FROM_TAP_DUE && (conn->flags & ACK_TO_TAP_DUE)) ||
>>   	    (flag == ~ACK_TO_TAP_DUE   && (conn->flags & ACK_FROM_TAP_DUE)))
>> -		tcp_timer_ctl(c, conn);
>> +		tcp_timer_ctl(conn);
>>   }
>>   
>>   /**
>> @@ -699,7 +702,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
>>   		tcp_epoll_ctl(c, conn);
>>   
>>   	if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
>> -		tcp_timer_ctl(c, conn);
>> +		tcp_timer_ctl(conn);
>>   }
>>   
>>   /**
>> @@ -1757,7 +1760,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>>   				   seq, conn->seq_from_tap);
>>   
>>   			tcp_send_flag(c, conn, ACK);
>> -			tcp_timer_ctl(c, conn);
>> +			tcp_timer_ctl(conn);
>>   
>>   			if (p->count == 1) {
>>   				tcp_tap_window_update(c, conn,
>> @@ -2406,7 +2409,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>>   
>>   	if (conn->flags & ACK_TO_TAP_DUE) {
>>   		tcp_send_flag(c, conn, ACK_IF_NEEDED);
>> -		tcp_timer_ctl(c, conn);
>> +		tcp_timer_ctl(conn);
>>   	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>>   		if (!(conn->events & ESTABLISHED)) {
>>   			flow_dbg(conn, "handshake timeout");
>> @@ -2428,7 +2431,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>>   				return;
>>   
>>   			tcp_data_from_sock(c, conn);
>> -			tcp_timer_ctl(c, conn);
>> +			tcp_timer_ctl(conn);
>>   		}
>>   	} else {
>>   		struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
>> @@ -3476,7 +3479,7 @@ int tcp_flow_migrate_source_ext(const struct ctx *c,
>>   	if (c->migrate_no_linger)
>>   		close(s);
>>   	else
>> -		epoll_del(c->epollfd, s);
>> +		epoll_del(flow_epollfd(&conn->f), s);
>>   
>>   	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
>>   	 * based on the end of the queues.
>> @@ -3625,7 +3628,7 @@ static int tcp_flow_repair_connect(const struct ctx *c,
>>   		return rc;
>>   	}
>>   
>> -	conn->in_epoll = 0;
>> +	conn->f.threadnb = FLOW_THREADNB_INVALID;
>>   	conn->timer = -1;
>>   	conn->listening_sock = -1;
>>   
>> diff --git a/tcp_conn.h b/tcp_conn.h
>> index 38b5c541f003..81333122d531 100644
>> --- a/tcp_conn.h
>> +++ b/tcp_conn.h
>> @@ -12,7 +12,6 @@
>>   /**
>>    * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
>>    * @f:			Generic flow information
>> - * @in_epoll:		Is the connection in the epoll set?
>>    * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
>>    * @ws_from_tap:	Window scaling factor advertised from tap/guest
>>    * @ws_to_tap:		Window scaling factor advertised to tap/guest
>> @@ -36,8 +35,6 @@ struct tcp_tap_conn {
>>   	/* Must be first element */
>>   	struct flow_common f;
>>   
>> -	bool		in_epoll	:1;
>> -
>>   #define TCP_RETRANS_BITS		3
>>   	unsigned int	retrans		:TCP_RETRANS_BITS;
>>   #define TCP_MAX_RETRANS			MAX_FROM_BITS(TCP_RETRANS_BITS)
>> @@ -196,7 +193,6 @@ struct tcp_tap_transfer_ext {
>>    * @written:		Bytes written (not fully written from one other side read)
>>    * @events:		Events observed/actions performed on connection
>>    * @flags:		Connection flags (attributes, not events)
>> - * @in_epoll:		Is the connection in the epoll set?
>>    */
>>   struct tcp_splice_conn {
>>   	/* Must be first element */
>> @@ -220,8 +216,6 @@ struct tcp_splice_conn {
>>   #define RCVLOWAT_SET(sidei_)		((sidei_) ? BIT(1) : BIT(0))
>>   #define RCVLOWAT_ACT(sidei_)		((sidei_) ? BIT(3) : BIT(2))
>>   #define CLOSING				BIT(4)
>> -
>> -	bool in_epoll	:1;
>>   };
>>   
>>   /* Socket pools */
>> @@ -245,7 +239,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd
>>   bool tcp_flow_is_established(const struct tcp_tap_conn *conn);
>>   
>>   bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
>> -void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
>> +void tcp_splice_timer(struct tcp_splice_conn *conn);
>>   int tcp_conn_pool_sock(int pool[]);
>>   int tcp_conn_sock(sa_family_t af);
>>   int tcp_sock_refill_pool(int pool[], sa_family_t af);
>> diff --git a/tcp_splice.c b/tcp_splice.c
>> index 6f21184bdc55..703bd7610890 100644
>> --- a/tcp_splice.c
>> +++ b/tcp_splice.c
>> @@ -149,7 +149,9 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
>>   static int tcp_splice_epoll_ctl(const struct ctx *c,
>>   				struct tcp_splice_conn *conn)
>>   {
>> -	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>> +	int epollfd = flow_in_epoll(&conn->f) ? flow_epollfd(&conn->f)
>> +						   : c->epollfd;
> 
> Same as above.
> 
> Given the new, more limited usage of c->epollfd, if it's not a temporary
> thing (I couldn't quite guess what your plan is), maybe worth updating
> its documentation in struct ctx?

In a later patch I remove epollfd from ctx and rely on the threadnb (0 for main) to get 
epollfd.

Thansk,
Laurent


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 7/7] passt: Move main event loop processing into passt_worker()
  2025-10-21  8:00     ` Laurent Vivier
@ 2025-10-22  0:53       ` David Gibson
  2025-10-22  6:49         ` Laurent Vivier
  0 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2025-10-22  0:53 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2248 bytes --]

On Tue, Oct 21, 2025 at 10:00:58AM +0200, Laurent Vivier wrote:
> On 20/10/2025 03:43, David Gibson wrote:
> > On Fri, Oct 17, 2025 at 12:31:29PM +0200, Laurent Vivier wrote:
> > > Extract the epoll event processing logic from main() into a separate
> > > passt_worker() function. This refactoring prepares the code for future
> > > threading support where passt_worker() will be called as a worker thread
> > > callback.
> > > 
> > > The new function handles:
> > > - Processing epoll events and dispatching to protocol handlers
> > > - Event statistics tracking and printing
> > > - Post-handler periodic tasks (timers, deferred work)
> > > - Migration handling
> > > 
> > > No functional changes, purely a code restructuring.
> > > 
> > > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > 
> > Looks good as far as it goes, and I've though often in the past that
> > it would make more sense for the "engine" to go in its own function.
> > 
> > Wondering if it would make more sense to include the epoll_wait()
> > itself and the loop in this function, rather than leaving that
> > outside.
> > 
> 
> When I introduce the multithreading and the multiqueue, as the thread is
> driven by the epollfd, the events are managed by the multiqueue part and the
> epollfd by the multithread part.
> 
> The "threading" worker is:
> 
> static void *threading_worker(void *opaque)
> {
>         struct threading_context *tc = opaque;
> 
>         while (true) {
>                 struct epoll_event events[NUM_EPOLL_EVENTS];
>                 int nfds;
> 
>                 nfds = epoll_wait(tc->epollfd, events, NUM_EPOLL_EVENTS,
>                                   TIMER_INTERVAL);
>                 if (nfds == -1 && errno != EINTR)
>                         die_perror("epoll_wait() failed");
> 
>                 tc->worker(tc->opaque, nfds, events);

IIUC the point here is that eventually the epoll_wait() will be
common, but the worker might be different for different threads.  Is
that correct?

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 2/7] epoll_ctl: Extract epoll operations
  2025-10-21 11:52     ` Laurent Vivier
@ 2025-10-22  0:58       ` David Gibson
  0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2025-10-22  0:58 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2992 bytes --]

On Tue, Oct 21, 2025 at 01:52:25PM +0200, Laurent Vivier wrote:
> On 20/10/2025 03:20, David Gibson wrote:
> > On Fri, Oct 17, 2025 at 12:31:24PM +0200, Laurent Vivier wrote:
> > > Centralize epoll_add() and epoll_del() helper functions into new
> > > epoll_ctl.c/h files.
> > > 
> > > This also moves the union epoll_ref definition from passt.h to
> > > epoll_ctl.h where it's more logically placed.
> > > 
> > > The new epoll_add() helper simplifies adding file descriptors to epoll
> > > by taking an epoll_ref and events, handling error reporting
> > > consistently across all call sites.
> > > 
> > > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > 
> > Concept looks good, some minor details in execution.
> > 
> > > ---
> > >   Makefile     | 22 +++++++++++-----------
> > >   epoll_ctl.c  | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > >   epoll_ctl.h  | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >   icmp.c       |  4 +---
> > >   passt.c      |  2 +-
> > >   passt.h      | 34 ----------------------------------
> > >   pasta.c      |  7 +++----
> > >   repair.c     | 14 +++++---------
> > >   tap.c        | 13 ++++---------
> > >   tcp.c        |  2 +-
> > >   tcp_splice.c |  2 +-
> > >   udp.c        |  2 +-
> > >   udp_flow.c   |  1 +
> > >   util.c       | 22 +++-------------------
> > >   util.h       |  4 +++-
> > >   vhost_user.c |  8 ++------
> > >   vu_common.c  |  2 +-
> > >   17 files changed, 134 insertions(+), 101 deletions(-)
> > >   create mode 100644 epoll_ctl.c
> > >   create mode 100644 epoll_ctl.h
> ...
> > > @@ -1327,14 +1327,12 @@ static void tap_backend_show_hints(struct ctx *c)
> > >   static void tap_sock_unix_init(const struct ctx *c)
> > >   {
> > >   	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN };
> > > -	struct epoll_event ev = { 0 };
> > >   	listen(c->fd_tap_listen, 0);
> > >   	ref.fd = c->fd_tap_listen;
> > > -	ev.events = EPOLLIN | EPOLLET;
> > > -	ev.data.u64 = ref.u64;
> > > -	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
> > > +
> > > +	epoll_add(c->epollfd, EPOLLIN | EPOLLET, &ref);
> > 
> > Preexisting, but we should probably check for errors here.
> > 
> 
> To do what? die() ? err() to report the place where the error happens?

The case I'm concerned about is this fails, but we carry on.  Then we
lose 1/Nth of our packets because we're not getting events for one
queue, and weird failures follow.  It takes us ages to debug, because
we don't consider that the fd might have silently failed to make it
into the epoll set.

err() would be probably be enough to make this debuggable.  But
honestly, if we can't put our tap fds into the epoll, we're in a
sufficiently bad state that die() makes sense.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common
  2025-10-21 12:14     ` Laurent Vivier
@ 2025-10-22  1:00       ` David Gibson
  0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2025-10-22  1:00 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1159 bytes --]

On Tue, Oct 21, 2025 at 02:14:02PM +0200, Laurent Vivier wrote:
> On 20/10/2025 03:34, David Gibson wrote:
> > > @@ -570,7 +572,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > >   		}
> > >   		conn->timer = fd;
> > > -		if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
> > > +		if (epoll_ctl(flow_epollfd(&conn->f), EPOLL_CTL_ADD,
> > > +			      conn->timer, &ev)) {
> > Possibly a question for an earlier patch, but is there a reason we
> > can't use epoll_add() here?
> > 
> 
> Yes, the fd we use in epoll_ctl() and the fd stored in ref are not the same.
> 
> The fd in ref is conn->sock (epoll_add() takes fd from ref) but in
> epoll_ctl() we add conn->timer (the fd from timerfd_create()).

Huh.  That seems like a pre-existing bug.  I'm pretty sure the epoll
ref for the timer ought to have the timer's fd in it.  I suspect we
just never previously used the ref.fd field for this case.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 7/7] passt: Move main event loop processing into passt_worker()
  2025-10-22  0:53       ` David Gibson
@ 2025-10-22  6:49         ` Laurent Vivier
  2025-10-23  1:24           ` David Gibson
  0 siblings, 1 reply; 27+ messages in thread
From: Laurent Vivier @ 2025-10-22  6:49 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 22/10/2025 02:53, David Gibson wrote:
> On Tue, Oct 21, 2025 at 10:00:58AM +0200, Laurent Vivier wrote:
>> On 20/10/2025 03:43, David Gibson wrote:
>>> On Fri, Oct 17, 2025 at 12:31:29PM +0200, Laurent Vivier wrote:
>>>> Extract the epoll event processing logic from main() into a separate
>>>> passt_worker() function. This refactoring prepares the code for future
>>>> threading support where passt_worker() will be called as a worker thread
>>>> callback.
>>>>
>>>> The new function handles:
>>>> - Processing epoll events and dispatching to protocol handlers
>>>> - Event statistics tracking and printing
>>>> - Post-handler periodic tasks (timers, deferred work)
>>>> - Migration handling
>>>>
>>>> No functional changes, purely a code restructuring.
>>>>
>>>> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
>>>
>>> Looks good as far as it goes, and I've though often in the past that
>>> it would make more sense for the "engine" to go in its own function.
>>>
>>> Wondering if it would make more sense to include the epoll_wait()
>>> itself and the loop in this function, rather than leaving that
>>> outside.
>>>
>>
>> When I introduce the multithreading and the multiqueue, as the thread is
>> driven by the epollfd, the events are managed by the multiqueue part and the
>> epollfd by the multithread part.
>>
>> The "threading" worker is:
>>
>> static void *threading_worker(void *opaque)
>> {
>>          struct threading_context *tc = opaque;
>>
>>          while (true) {
>>                  struct epoll_event events[NUM_EPOLL_EVENTS];
>>                  int nfds;
>>
>>                  nfds = epoll_wait(tc->epollfd, events, NUM_EPOLL_EVENTS,
>>                                    TIMER_INTERVAL);
>>                  if (nfds == -1 && errno != EINTR)
>>                          die_perror("epoll_wait() failed");
>>
>>                  tc->worker(tc->opaque, nfds, events);
> 
> IIUC the point here is that eventually the epoll_wait() will be
> common, but the worker might be different for different threads.  Is
> that correct?
> 

Yes, we can have the passt_worker() for all threads, but we can also have a specific 
worker for passt main (with netlink, listen, ...), a specific for TX vhost-user (with 
kickfd), and another one for RX (with all the ICMP, UDP, TCP sockets).

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 7/7] passt: Move main event loop processing into passt_worker()
  2025-10-22  6:49         ` Laurent Vivier
@ 2025-10-23  1:24           ` David Gibson
  0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2025-10-23  1:24 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3228 bytes --]

On Wed, Oct 22, 2025 at 08:49:38AM +0200, Laurent Vivier wrote:
> On 22/10/2025 02:53, David Gibson wrote:
> > On Tue, Oct 21, 2025 at 10:00:58AM +0200, Laurent Vivier wrote:
> > > On 20/10/2025 03:43, David Gibson wrote:
> > > > On Fri, Oct 17, 2025 at 12:31:29PM +0200, Laurent Vivier wrote:
> > > > > Extract the epoll event processing logic from main() into a separate
> > > > > passt_worker() function. This refactoring prepares the code for future
> > > > > threading support where passt_worker() will be called as a worker thread
> > > > > callback.
> > > > > 
> > > > > The new function handles:
> > > > > - Processing epoll events and dispatching to protocol handlers
> > > > > - Event statistics tracking and printing
> > > > > - Post-handler periodic tasks (timers, deferred work)
> > > > > - Migration handling
> > > > > 
> > > > > No functional changes, purely a code restructuring.
> > > > > 
> > > > > Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> > > > 
> > > > Looks good as far as it goes, and I've though often in the past that
> > > > it would make more sense for the "engine" to go in its own function.
> > > > 
> > > > Wondering if it would make more sense to include the epoll_wait()
> > > > itself and the loop in this function, rather than leaving that
> > > > outside.
> > > > 
> > > 
> > > When I introduce the multithreading and the multiqueue, as the thread is
> > > driven by the epollfd, the events are managed by the multiqueue part and the
> > > epollfd by the multithread part.
> > > 
> > > The "threading" worker is:
> > > 
> > > static void *threading_worker(void *opaque)
> > > {
> > >          struct threading_context *tc = opaque;
> > > 
> > >          while (true) {
> > >                  struct epoll_event events[NUM_EPOLL_EVENTS];
> > >                  int nfds;
> > > 
> > >                  nfds = epoll_wait(tc->epollfd, events, NUM_EPOLL_EVENTS,
> > >                                    TIMER_INTERVAL);
> > >                  if (nfds == -1 && errno != EINTR)
> > >                          die_perror("epoll_wait() failed");
> > > 
> > >                  tc->worker(tc->opaque, nfds, events);
> > 
> > IIUC the point here is that eventually the epoll_wait() will be
> > common, but the worker might be different for different threads.  Is
> > that correct?
> > 
> 
> Yes, we can have the passt_worker() for all threads, but we can also have a
> specific worker for passt main (with netlink, listen, ...), a specific for
> TX vhost-user (with kickfd), and another one for RX (with all the ICMP, UDP,
> TCP sockets).

Ok.  But do the different workers need to be different in their code,
rather than just what fds end up in their epoll pool?  I can see that
there might be some worthwhile improvements to be had by specialising
the worker functions.  On the other hand, if we keep the worker loop
generic, it would make it easier to refine our workload balancing by
moving different fds between the threads.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-10-23  1:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-17 10:31 [PATCH v4 0/7] Refactor epoll handling in preparation for multithreading Laurent Vivier
2025-10-17 10:31 ` [PATCH v4 1/7] util: Simplify epoll_del() interface to take epollfd directly Laurent Vivier
2025-10-17 10:31 ` [PATCH v4 2/7] epoll_ctl: Extract epoll operations Laurent Vivier
2025-10-17 11:48   ` Stefano Brivio
2025-10-17 12:21     ` Laurent Vivier
2025-10-17 13:05       ` Stefano Brivio
2025-10-20  1:20   ` David Gibson
2025-10-21 11:52     ` Laurent Vivier
2025-10-22  0:58       ` David Gibson
2025-10-17 10:31 ` [PATCH v4 3/7] util: Move epoll registration out of sock_l4_sa() Laurent Vivier
2025-10-17 10:31 ` [PATCH v4 4/7] tcp, flow: Replace per-connection in_epoll flag with threadnb in flow_common Laurent Vivier
2025-10-17 17:43   ` Stefano Brivio
2025-10-21 13:13     ` Laurent Vivier
2025-10-20  1:34   ` David Gibson
2025-10-21 12:14     ` Laurent Vivier
2025-10-22  1:00       ` David Gibson
2025-10-17 10:31 ` [PATCH v4 5/7] icmp: Use thread-based epoll management for ICMP flows Laurent Vivier
2025-10-20  1:35   ` David Gibson
2025-10-17 10:31 ` [PATCH v4 6/7] udp: Use thread-based epoll management for UDP flows Laurent Vivier
2025-10-20  1:39   ` David Gibson
2025-10-17 10:31 ` [PATCH v4 7/7] passt: Move main event loop processing into passt_worker() Laurent Vivier
2025-10-17 17:43   ` Stefano Brivio
2025-10-20  1:43   ` David Gibson
2025-10-21  8:00     ` Laurent Vivier
2025-10-22  0:53       ` David Gibson
2025-10-22  6:49         ` Laurent Vivier
2025-10-23  1:24           ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).