From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202606 header.b=hwZ3HbCu; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 71B255A0265 for ; Fri, 19 Jun 2026 08:52:21 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202606; t=1781851937; bh=GWa45nrN1HvEjS5R5skqj4hpM27On3rMat+IqTC8VDM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=hwZ3HbCuZyeQ4lvxkw9EVYCI2kA12lWHyjSXPOtXbpdeJb0pRPsKa2ZPbIzaH+M3F qk92nE7bsd3Fu+7RUNkwUbP3VtwYqJkz5NDRnpVIJzJH4lANXSMjuCAEp/9UEKwMiy wyTchBhR9mtcIsCe390LyUZtynxx3bY2am2CUhLl/xVclQggZ5+0F6mdD3UxyUMGon zVFkEvkgVrDKy+N60BQBTkfIOh1ZhtrS78f4fmu3IRH1B0d9h5MWTxsgM4F7h75QMD C4y0tdFsuCN6SQxSz0ceC6LPX4gOCzIBgDZwyGKxz3JNOd9SPMCBFlmiEPsdslLGX7 XtBG47zwBAN/w== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4ghSvP6VpPz58t0; Fri, 19 Jun 2026 16:52:17 +1000 (AEST) Date: Fri, 19 Jun 2026 16:52:12 +1000 From: David Gibson To: Laurent Vivier Subject: Re: [PATCH v5 12/12] flow: Derive epoll fd from queue pair, removing epollid field Message-ID: References: <20260616125130.1324274-1-lvivier@redhat.com> <20260616125130.1324274-13-lvivier@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="OdIPjbbGYvS1wKgy" Content-Disposition: inline In-Reply-To: <20260616125130.1324274-13-lvivier@redhat.com> Message-ID-Hash: 7TRZDY2BFKBDPHLOIZBJFGJ6DR7OFCJ4 X-Message-ID-Hash: 7TRZDY2BFKBDPHLOIZBJFGJ6DR7OFCJ4 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --OdIPjbbGYvS1wKgy Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jun 16, 2026 at 02:51:30PM +0200, Laurent Vivier wrote: > Since each queue pair maps to exactly one epoll instance, the epoll > file descriptor can be looked up directly from the qpair field. This > makes the separate epollid field in flow_common redundant. >=20 > Replace epoll_id_to_fd[] with qpair_to_fd[], remove > flow_epollid_set(), flow_epollid_register(), flow_qp()/FLOW_QP(), > and the epollid field from flow_common. FLOW_QPAIR_INVALID is no > longer needed: newly allocated flows get qpair 0 from memset. It's not obvious to me why it was needed before, but not now. > For new flows, FLOW_SETQP() sets the queue pair during creation, and > the socket is registered with epoll separately via flow_epoll_set(). >=20 > For existing flows that may move between queue pairs, add > flow_migrate()/FLOW_MIGRATE(), which removes the socket from the old > epoll instance and re-registers it on the new one. >=20 > TCP timers are migrated lazily: tcp_timer_handler() detects a qpair > mismatch when a timer fires, moves the timerfd to the correct epoll > instance, and returns without further processing. >=20 > flow_init() now takes the execution context to initialise > qpair_to_fd[]. >=20 > Signed-off-by: Laurent Vivier > --- > flow.c | 63 ++++++++++++++++++++-------------------------------- > flow.h | 25 ++++++++------------- > icmp.c | 3 +-- > passt.c | 3 +-- > tcp.c | 42 +++++++++++++++++++++++++---------- > tcp_splice.c | 1 - > udp_flow.c | 4 +--- > 7 files changed, 66 insertions(+), 75 deletions(-) >=20 > diff --git a/flow.c b/flow.c > index bf855fe0dfaf..787a7139cfc1 100644 > --- a/flow.c > +++ b/flow.c > @@ -130,7 +130,7 @@ static_assert(ARRAY_SIZE(flow_epoll) =3D=3D FLOW_NUM_= TYPES, > unsigned flow_first_free; > union flow flowtab[FLOW_MAX]; > static const union flow *flow_new_entry; /* =3D NULL */ > -static int epoll_id_to_fd[EPOLLFD_ID_SIZE]; > +int qpair_to_fd[FLOW_QPAIR_SIZE]; > =20 > /* Hash table to index it */ > #define FLOW_HASH_LOAD 70 /* % */ > @@ -362,19 +362,7 @@ static void flow_set_state(struct flow_common *f, en= um flow_state state) > */ > int flow_epollfd(const struct flow_common *f) > { > - return epoll_id_to_fd[f->epollid]; > -} > - > -/** > - * flow_epollid_set() - Associate a flow with an epoll id > - * @f: Flow to update > - * @epollid: epoll id to associate with this flow > - */ > -void flow_epollid_set(struct flow_common *f, int epollid) > -{ > - assert(epollid < EPOLLFD_ID_SIZE); > - > - f->epollid =3D epollid; > + return qpair_to_fd[f->qpair]; > } > =20 > /** > @@ -404,40 +392,31 @@ int flow_epoll_set(const struct flow_common *f, int= command, uint32_t events, > } > =20 > /** > - * flow_epollid_register() - Initialize the epoll id -> fd mapping > - * @epollid: epoll id to associate to > - * @epollfd: epoll file descriptor for this epoll id > + * flow_setqp() - Set queue pair assignment for a flow > + * @f: Flow to update > + * @qpair: Queue pair number to assign > */ > -void flow_epollid_register(int epollid, int epollfd) > +void flow_setqp(struct flow_common *f, unsigned int qpair) > { > - assert(epollid < EPOLLFD_ID_SIZE); > + assert(qpair < FLOW_QPAIR_SIZE); > - epoll_id_to_fd[epollid] =3D epollfd; > -} > + flow_trace((union flow *)f, "setting queue pair to %d", qpair); > =20 > -/** > - * flow_qp() - Get the queue pair for a flow > - * @f: Flow to query (may be NULL) > - * > - * Return: queue pair number for the flow, or 0 if flow is NULL or has no > - * valid queue pair assignment > - */ > -/* cppcheck-suppress unusedFunction */ > -unsigned int flow_qp(const struct flow_common *f) > -{ > - if (f =3D=3D NULL || f->qpair =3D=3D FLOW_QPAIR_INVALID) > - return QPAIR_DEFAULT; > - return f->qpair; > + f->qpair =3D qpair; > } > =20 > /** > - * flow_setqp() - Set queue pair assignment for a flow > + * flow_migrate() - Migrate a flow to a different queue pair > * @f: Flow to update > * @qpair: Queue pair number to assign > + * @events: epoll events to watch for > + * @fd: File descriptor to register > + * @sidei: Side index of the flow > */ > -void flow_setqp(struct flow_common *f, unsigned int qpair) > +void flow_migrate(struct flow_common *f, unsigned int qpair, uint32_t ev= ents, > + int fd, unsigned int sidei) > { > - assert(qpair < FLOW_QPAIR_MAX); > + assert(qpair < FLOW_QPAIR_SIZE); > =20 > if (f->qpair =3D=3D qpair) > return; > @@ -445,7 +424,10 @@ void flow_setqp(struct flow_common *f, unsigned int = qpair) > flow_trace((union flow *)f, "updating queue pair from %d to %d", > f->qpair, qpair); > =20 > + epoll_del(qpair_to_fd[f->qpair], fd); > + > f->qpair =3D qpair; > + flow_epoll_set(f, EPOLL_CTL_ADD, events, fd, sidei); > } > =20 > /** > @@ -669,7 +651,6 @@ union flow *flow_alloc(void) > =20 > flow_new_entry =3D flow; > memset(flow, 0, sizeof(*flow)); > - flow->f.qpair =3D FLOW_QPAIR_INVALID; > flow_set_state(&flow->f, FLOW_STATE_NEW); > =20 > return flow; > @@ -1295,8 +1276,9 @@ int flow_migrate_target(struct ctx *c, const struct= migrate_stage *stage, > =20 > /** > * flow_init() - Initialise flow related data structures > + * @c: Execution context > */ > -void flow_init(void) > +void flow_init(const struct ctx *c) > { > unsigned b; > =20 > @@ -1306,4 +1288,7 @@ void flow_init(void) > =20 > for (b =3D 0; b < FLOW_HASH_SIZE; b++) > flow_hashtab[b] =3D FLOW_SIDX_NONE; > + > + for (b =3D 0; b < FLOW_QPAIR_SIZE; b++) > + qpair_to_fd[b] =3D c->epollfd; > } > diff --git a/flow.h b/flow.h > index 3c74dcbd95c4..53e0408a9ee5 100644 > --- a/flow.h > +++ b/flow.h > @@ -157,6 +157,8 @@ struct flowside { > in_port_t eport; > }; > =20 > +extern int qpair_to_fd[]; > + > /** > * flowside_eq() - Check if two flowsides are equal > * @left, @right: Flowsides to compare > @@ -204,20 +206,12 @@ struct flow_common { > =20 > uint8_t tap_omac[6]; > =20 > -#define EPOLLFD_ID_BITS 8 > - unsigned int epollid:EPOLLFD_ID_BITS; > #define FLOW_QPAIR_BITS 5 > unsigned int qpair:FLOW_QPAIR_BITS; > }; > =20 > -#define EPOLLFD_ID_DEFAULT 0 > -#define EPOLLFD_ID_SIZE (1 << EPOLLFD_ID_BITS) > - > -#define FLOW_QPAIR_NUM (1 << FLOW_QPAIR_BITS) > -#define FLOW_QPAIR_MAX (FLOW_QPAIR_NUM - 1) > -#define FLOW_QPAIR_INVALID FLOW_QPAIR_MAX > - > -static_assert(VHOST_USER_MAX_VQS <=3D FLOW_QPAIR_MAX * 2); > +#define FLOW_QPAIR_SIZE (1 << FLOW_QPAIR_BITS) > +static_assert(VHOST_USER_MAX_VQS <=3D FLOW_QPAIR_SIZE * 2); > =20 > #define FLOW_INDEX_BITS 17 /* 128k - 1 */ > #define FLOW_MAX MAX_FROM_BITS(FLOW_INDEX_BITS) > @@ -273,18 +267,17 @@ flow_sidx_t flow_lookup_sa(const struct ctx *c, uin= t8_t proto, uint8_t pif, > =20 > union flow; > =20 > -void flow_init(void); > +void flow_init(const struct ctx *c); > int flow_epollfd(const struct flow_common *f); > -void flow_epollid_set(struct flow_common *f, int epollid); > int flow_epoll_set(const struct flow_common *f, int command, uint32_t ev= ents, > int fd, unsigned int sidei); > -void flow_epollid_register(int epollid, int epollfd); > -unsigned int flow_qp(const struct flow_common *f); > -#define FLOW_QP(flow_) \ > - (flow_qp(&(flow_)->f)) > void flow_setqp(struct flow_common *f, unsigned int qpair); > #define FLOW_SETQP(flow_, _qpair) \ > (flow_setqp(&(flow_)->f, _qpair)) > +void flow_migrate(struct flow_common *f, unsigned int qpair, uint32_t ev= ents, > + int fd, unsigned int sidei); > +#define FLOW_MIGRATE(flow_, qpair_, events_, fd_, sidei_) \ > + (flow_migrate(&(flow_)->f, qpair_, events_, fd_, sidei_)) > =20 > void flow_defer_handler(const struct ctx *c, const struct timespec *now, > unsigned int qpair); > diff --git a/icmp.c b/icmp.c > index 2558fe5beaab..98ce55a8aff0 100644 > --- a/icmp.c > +++ b/icmp.c > @@ -216,7 +216,6 @@ static struct icmp_ping_flow *icmp_ping_new(const str= uct ctx *c, > goto cancel; > =20 > FLOW_SETQP(pingf, qpair); Having both FLOW_SETQP() and FLOW_MIGRATE() seems dangerous to me. I think setting the initial value in flow_initiate_() or similar, then only changing it via FLOW_MIGRATE() seems like a better idea. > - flow_epollid_set(&pingf->f, EPOLLFD_ID_DEFAULT); > if (flow_epoll_set(&pingf->f, EPOLL_CTL_ADD, EPOLLIN, pingf->sock, > TGTSIDE) < 0) { > close(pingf->sock); > @@ -307,7 +306,7 @@ int icmp_tap_handler(const struct ctx *c, unsigned in= t qpair, uint8_t pif, > =20 > if (flow) { > pingf =3D &flow->ping; > - FLOW_SETQP(pingf, qpair); /* XXX if qpair change, update epollfd */ > + FLOW_MIGRATE(pingf, qpair, EPOLLIN, pingf->sock, TGTSIDE); > } else if (!(pingf =3D icmp_ping_new(c, qpair, af, id, saddr, daddr))) { > return 1; > } > diff --git a/passt.c b/passt.c > index c9e456641e85..3afc59b19120 100644 > --- a/passt.c > +++ b/passt.c > @@ -365,7 +365,6 @@ int main(int argc, char **argv) > c.epollfd =3D epoll_create1(EPOLL_CLOEXEC); > if (c.epollfd =3D=3D -1) > die_perror("Failed to create epoll file descriptor"); > - flow_epollid_register(EPOLLFD_ID_DEFAULT, c.epollfd); > =20 > if (getrlimit(RLIMIT_NOFILE, &limit)) > die_perror("Failed to get maximum value of open files limit"); > @@ -388,7 +387,7 @@ int main(int argc, char **argv) > if (clock_gettime(CLOCK_MONOTONIC, &now)) > die_perror("Failed to get CLOCK_MONOTONIC time"); > =20 > - flow_init(); > + flow_init(&c); > fwd_scan_ports_init(&c); > =20 > if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) > diff --git a/tcp.c b/tcp.c > index c0a4de33f068..1549e14adaf4 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -541,6 +541,21 @@ static int tcp_epoll_ctl(struct tcp_tap_conn *conn) > return 0; > } > =20 > +static int tcp_timer_epoll_add(struct tcp_tap_conn *conn, int fd) > +{ > + union epoll_ref ref; > + > + ref.type =3D EPOLL_TYPE_TCP_TIMER; > + ref.flow =3D FLOW_IDX(conn); > + ref.fd =3D fd; > + if (epoll_add(flow_epollfd(&conn->f), EPOLLIN | EPOLLET, ref) < 0) { > + flow_dbg(conn, "failed to add timer"); > + return -1; > + } > + > + return 0; > +} > + > /** > * tcp_timer_ctl() - Set timerfd based on flags/events, create timerfd i= f needed > * @c: Execution context > @@ -555,7 +570,6 @@ static void tcp_timer_ctl(const struct ctx *c, struct= tcp_tap_conn *conn) > return; > =20 > if (conn->timer =3D=3D -1) { > - union epoll_ref ref; > int fd; > =20 > fd =3D timerfd_create(CLOCK_MONOTONIC, 0); > @@ -570,12 +584,7 @@ static void tcp_timer_ctl(const struct ctx *c, struc= t tcp_tap_conn *conn) > return; > } > =20 > - ref.type =3D EPOLL_TYPE_TCP_TIMER; > - ref.flow =3D FLOW_IDX(conn); > - ref.fd =3D fd; > - if (epoll_add(flow_epollfd(&conn->f), EPOLLIN | EPOLLET, > - ref) < 0) { > - flow_dbg(conn, "failed to add timer"); > + if (tcp_timer_epoll_add(conn, fd) < 0) { > close(fd); > return; > } > @@ -1736,7 +1745,6 @@ static void tcp_conn_from_tap(const struct ctx *c, = unsigned int qpair, > conn->sock =3D s; > conn->timer =3D -1; > FLOW_SETQP(conn, qpair); > - flow_epollid_set(&conn->f, EPOLLFD_ID_DEFAULT); > if (flow_epoll_set(&conn->f, EPOLL_CTL_ADD, 0, s, TGTSIDE) < 0) { > flow_perror(flow, "Can't register with epoll"); > goto cancel; > @@ -2315,8 +2323,9 @@ int tcp_tap_handler(const struct ctx *c, unsigned i= nt qpair, uint8_t pif, > assert(pif_at_sidx(sidx) =3D=3D PIF_TAP); > conn =3D &flow->tcp; > =20 > - /* update queue pair */ > - FLOW_SETQP(flow, qpair); > + FLOW_MIGRATE(flow, qpair, > + tcp_conn_epoll_events(conn->events, conn->flags), > + conn->sock, !TAPSIDE(conn)); Doesn't the timer fd also need to be migrated to the matching qpair/epollfd? > =20 > flow_trace(conn, "packet length %zu from tap", l4len); > =20 > @@ -2523,7 +2532,6 @@ static void tcp_tap_conn_from_sock(const struct ctx= *c, union flow *flow, > conn->ws_to_tap =3D conn->ws_from_tap =3D 0; > =20 > FLOW_SETQP(conn, QPAIR_DEFAULT); > - flow_epollid_set(&conn->f, EPOLLFD_ID_DEFAULT); > if (flow_epoll_set(&conn->f, EPOLL_CTL_ADD, 0, s, INISIDE) < 0) { > flow_perror(flow, "Can't register with epoll"); > conn_flag(c, conn, CLOSING); > @@ -2646,6 +2654,17 @@ void tcp_timer_handler(const struct ctx *c, union = epoll_ref ref, > assert(!c->no_tcp); > assert(conn->f.type =3D=3D FLOW_TCP); > =20 > + if (conn->f.qpair !=3D qpair) { > + int old_epollfd =3D qpair_to_fd[qpair]; > + > + epoll_del(old_epollfd, conn->timer); > + if (tcp_timer_epoll_add(conn, conn->timer) < 0) { > + close(conn->timer); > + conn->timer =3D -1; > + } Oh, I see, it's done lazily. Why? Also, don't you still need to actually process the current timer event as well as migrate it for future events? > + return; > + } > + > /* We don't reset timers on ~ACK_FROM_TAP_DUE, ~ACK_TO_TAP_DUE. If the > * timer is currently armed, this event came from a previous setting, > * and we just set the timer to a new point in the future: discard it. > @@ -3853,7 +3872,6 @@ int tcp_flow_migrate_target(struct ctx *c, int fd) > } > =20 > FLOW_SETQP(conn, QPAIR_DEFAULT); > - flow_epollid_set(&conn->f, EPOLLFD_ID_DEFAULT); > if (flow_epoll_set(&conn->f, EPOLL_CTL_ADD, 0, conn->sock, > !TAPSIDE(conn))) > goto out; /* tcp_flow_migrate_target_ext() will clean this up */ > diff --git a/tcp_splice.c b/tcp_splice.c > index 1a77ac2e8a18..3215337d3e62 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -378,7 +378,6 @@ static int tcp_splice_connect(const struct ctx *c, st= ruct tcp_splice_conn *conn) > pif_sockaddr(c, &sa, tgtpif, &tgt->eaddr, tgt->eport); > =20 > FLOW_SETQP(conn, QPAIR_DEFAULT); > - flow_epollid_set(&conn->f, EPOLLFD_ID_DEFAULT); > if (flow_epoll_set(&conn->f, EPOLL_CTL_ADD, 0, conn->s[0], 0) || > flow_epoll_set(&conn->f, EPOLL_CTL_ADD, 0, conn->s[1], 1)) { > int ret =3D -errno; > diff --git a/udp_flow.c b/udp_flow.c > index 44e0c4c50ca9..27dc24ffb2ae 100644 > --- a/udp_flow.c > +++ b/udp_flow.c > @@ -154,7 +154,6 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, = unsigned int qpair, > uflow->activity[INISIDE] =3D 1; > uflow->activity[TGTSIDE] =3D 0; > FLOW_SETQP(uflow, qpair); > - flow_epollid_set(&uflow->f, EPOLLFD_ID_DEFAULT); > =20 > flow_foreach_sidei(sidei) { > if (pif_is_socket(uflow->f.pif[sidei])) > @@ -292,8 +291,7 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, un= signed int qpair, > srcport, dstport); > if ((uflow =3D udp_at_sidx(sidx))) { > udp_flow_activity(uflow, sidx.sidei, now); > - /* update qpair */ > - FLOW_SETQP(uflow, qpair); /* if qpair changes, update epollfd */ > + FLOW_MIGRATE(uflow, qpair, EPOLLIN, uflow->s[TGTSIDE], TGTSIDE); > return flow_sidx_opposite(sidx); > } > =20 > --=20 > 2.54.0 >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --OdIPjbbGYvS1wKgy Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmo05w4ACgkQzQJF27ox 2GdJDBAAhQ2FX08lH0MEG2re1gG7nK90i7vlBaCoBE6htxMnY93JjADfhjHRXK4H e25mRW2WCgNUZqIZgHgy1XcdnXDy9U0mRerfkTW3pwd0j51WC+UpuvGfeGqhZTat cfj7yBpLBRaONdZukeA9ZW9VAE17QJnxxypWH932pePEInXGHGLy9ns+6qwDDLza 9fLs3kH4OaQlF75N6Y6WfzSpqldwnIK7oXnGTsKAOUW3eHpMUc6izW3bi4+A+XI6 +wTtGG1QiPHr2Jut7gA/5JZ7sjs8VGxuWVojQ4JXDvLLcWWt+y/8k5o+3TygK4oY I9CiUl9hPIPdMQ4iS8VnQ9ljdAj0M3IgpMuExNjj1cwRotEC1+2e057ALrUIn57H fDpHE2/FOEyYDBCdNTEmtr4BiGRV9k5OKkzrrbgQ6KkKW3lIhA44nO0MurfqzNY/ SR7w4RDaJ/o5L4qluABRSCDhKdtVuXyZil82lOqS+m974d9RAr/4i3kIX3DHK+GJ dcU8iETNzr8SMy2rVc8NnxLGv86MrJ7gQ5Z7sq5DkgFOhZhpQ8rDVdOVlQQ2suWI 2qjQcoQIAb8hllfw8ykjEkjQm94VOWeoI2W3GCDaP4uqFszhiK/Y9GqU1z0AxD/6 zYLEfHuXwQNHoGFaR8hueDFSYokiJwgag5vLDDiLFg2LMdcAsl8= =qVHK -----END PGP SIGNATURE----- --OdIPjbbGYvS1wKgy--