On Fri, Dec 19, 2025 at 05:45:16PM +0100, Laurent Vivier wrote: > In tcp_epoll_ctl() and tcp_splice_epoll_ctl(), flow_epollid_set() is > called after flow_epoll_set() to set the epollid to EPOLLFD_ID_DEFAULT. > When updating an existing flow (EPOLL_CTL_MOD), flow_epoll_set() already > retrieves the correct epoll fd via flow_epollfd(). For new additions > (EPOLL_CTL_ADD), we can use epoll_id_to_fd[EPOLLFD_ID_DEFAULT] directly > instead of c->epollfd, since that's the epollid that will be set. This comment is kind of pre-existing and kind of more related to the previous patch. But, the way we handle 'in_epoll' versus EPOLL_CTL_ADD and EPOLL_CTL_MOD has always bothered me. We're using a bit in the flow structure to make exactly one transition and affect exactly one call to epoll_ctl(). Moreover the semantics of in_epoll are kind of confusing. It assumes all fds of the flow are synchronized in being in/not-in the epoll. That's basically true, but it's a somewhat non-obvious constraint. Worse, it doesn't truly track the "not in" state - we don't clear it when we remove the fd from epoll at the end of the flow's lifetime: and we don't want to (just) do that, because that would mean we a stray call to flow_epoll_set() would incorrectly re-add the fd of a dying flow to the epoll. This last is (part of) what's going on in bug 165. I've thought for a while we might be better off to always keep fds in the epoll for the full flow lifecycle: we add them as part of the flow and socket creation path, remove them on close. In between we always use EPOLL_CTL_MOD. That removes the logical need for the in_epoll bit. You'd still need the epollfd for multiqueue handling of course, so it wouldn't actually save any space. Still, I'm beginning to think making that change might simplify what you do in this series enough to be worth it. If we do need to "quiesce" fds for a while (I'm not sure we do at present), we can set them to an empty event mask, rather than entirely remove them from the epoll. That does mean we need to be prepared for EPOLLERR events (which can;t be masked) at any time - but that strikes me as probably a good idea, anyway. > This removes the need for the ctx parameter from flow_epoll_set() and > simplifies the API. The change cascades through all callers: tcp.c, > icmp.c, udp_flow.c, and tcp_splice.c. > > In tcp_splice.c, removing ctx from flow_epoll_set() calls allows us to > also remove the ctx parameter from tcp_splice_epoll_ctl() and > conn_event_do(), further simplifying the splice connection handling. > > Signed-off-by: Laurent Vivier Comments above might change the context. But in the context of what's already here, this change makes sense to me, so, Reviewed-by: David Gibson > --- > flow.c | 8 +++----- > flow.h | 5 ++--- > icmp.c | 2 +- > tcp.c | 6 +++--- > tcp_splice.c | 32 ++++++++++++++------------------ > udp_flow.c | 2 +- > 6 files changed, 24 insertions(+), 31 deletions(-) > > diff --git a/flow.c b/flow.c > index 9c98102e3616..a3f75015a078 100644 > --- a/flow.c > +++ b/flow.c > @@ -393,7 +393,6 @@ void flow_epollid_clear(struct flow_common *f) > > /** > * flow_epoll_set() - Add or modify epoll registration for a flow socket > - * @c: Execution context > * @type: epoll type > * @f: Flow to register socket for > * @events: epoll events to watch for > @@ -402,9 +401,8 @@ void flow_epollid_clear(struct flow_common *f) > * > * Return: 0 on success, -1 on error (from epoll_ctl()) > */ > -int flow_epoll_set(const struct ctx *c, enum epoll_type type, > - const struct flow_common *f, uint32_t events, int fd, > - unsigned int sidei) > +int flow_epoll_set(enum epoll_type type, const struct flow_common *f, > + uint32_t events, int fd, unsigned int sidei) > { > struct epoll_event ev; > union epoll_ref ref; > @@ -423,7 +421,7 @@ int flow_epoll_set(const struct ctx *c, enum epoll_type type, > epollfd = flow_epollfd(f); > } else { > m = EPOLL_CTL_ADD; > - epollfd = c->epollfd; > + epollfd = epoll_id_to_fd[EPOLLFD_ID_DEFAULT]; > } > break; > case EPOLL_TYPE_TCP_TIMER: { > diff --git a/flow.h b/flow.h > index 388fab124238..9740ede8963e 100644 > --- a/flow.h > +++ b/flow.h > @@ -265,9 +265,8 @@ bool flow_in_epoll(const struct flow_common *f); > int flow_epollfd(const struct flow_common *f); > void flow_epollid_set(struct flow_common *f, int epollid); > void flow_epollid_clear(struct flow_common *f); > -int flow_epoll_set(const struct ctx *c, enum epoll_type type, > - const struct flow_common *f, uint32_t events, int fd, > - unsigned int sidei); > +int flow_epoll_set(enum epoll_type type, const struct flow_common *f, > + uint32_t events, int fd, unsigned int sidei); > void flow_epollid_register(int epollid, int epollfd); > void flow_defer_handler(const struct ctx *c, const struct timespec *now); > int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage, > diff --git a/icmp.c b/icmp.c > index 235759567060..f0b6fcff1776 100644 > --- a/icmp.c > +++ b/icmp.c > @@ -210,7 +210,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, > goto cancel; > > flow_epollid_set(&pingf->f, EPOLLFD_ID_DEFAULT); > - if (flow_epoll_set(c, EPOLL_TYPE_PING, &pingf->f, EPOLLIN, pingf->sock, > + if (flow_epoll_set(EPOLL_TYPE_PING, &pingf->f, EPOLLIN, pingf->sock, > TGTSIDE) < 0) { > close(pingf->sock); > goto cancel; > diff --git a/tcp.c b/tcp.c > index 2907af282551..ec4f6d2ecc04 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -543,14 +543,14 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > events = tcp_conn_epoll_events(conn->events, conn->flags); > > - if (flow_epoll_set(c, EPOLL_TYPE_TCP, &conn->f, events, conn->sock, > + if (flow_epoll_set(EPOLL_TYPE_TCP, &conn->f, events, conn->sock, > !TAPSIDE(conn))) > return -errno; > > flow_epollid_set(&conn->f, EPOLLFD_ID_DEFAULT); > > if (conn->timer != -1) { > - if (flow_epoll_set(c, EPOLL_TYPE_TCP_TIMER, &conn->f, > + if (flow_epoll_set(EPOLL_TYPE_TCP_TIMER, &conn->f, > EPOLLIN | EPOLLET, conn->timer, 0)) > return -errno; > } > @@ -582,7 +582,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > return; > } > > - if (flow_epoll_set(c, EPOLL_TYPE_TCP_TIMER, &conn->f, > + if (flow_epoll_set(EPOLL_TYPE_TCP_TIMER, &conn->f, > EPOLLIN | EPOLLET, fd, 0)) { > flow_dbg_perror(conn, "failed to add timer"); > close(fd); > diff --git a/tcp_splice.c b/tcp_splice.c > index 7367f435367b..ccf627f4427d 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -135,22 +135,20 @@ static uint32_t tcp_splice_conn_epoll_events(uint16_t events, unsigned sidei) > > /** > * tcp_splice_epoll_ctl() - Add/modify/delete epoll state from connection events > - * @c: Execution context > * @conn: Connection pointer > * > * Return: 0 on success, negative error code on failure (not on deletion) > */ > -static int tcp_splice_epoll_ctl(const struct ctx *c, > - struct tcp_splice_conn *conn) > +static int tcp_splice_epoll_ctl(struct tcp_splice_conn *conn) > { > uint32_t events[2]; > > events[0] = tcp_splice_conn_epoll_events(conn->events, 0); > events[1] = tcp_splice_conn_epoll_events(conn->events, 1); > > - if (flow_epoll_set(c, EPOLL_TYPE_TCP_SPLICE, &conn->f, events[0], > + if (flow_epoll_set(EPOLL_TYPE_TCP_SPLICE, &conn->f, events[0], > conn->s[0], 0) || > - flow_epoll_set(c, EPOLL_TYPE_TCP_SPLICE, &conn->f, events[1], > + flow_epoll_set(EPOLL_TYPE_TCP_SPLICE, &conn->f, events[1], > conn->s[1], 1)) { > int ret = -errno; > flow_perror(conn, "ERROR on epoll_ctl()"); > @@ -204,12 +202,10 @@ static void conn_flag_do(struct tcp_splice_conn *conn, > > /** > * conn_event_do() - Set and log connection events, update epoll state > - * @c: Execution context > * @conn: Connection pointer > * @event: Connection event > */ > -static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, > - unsigned long event) > +static void conn_event_do(struct tcp_splice_conn *conn, unsigned long event) > { > if (event & (event - 1)) { > int flag_index = fls(~event); > @@ -231,14 +227,14 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, > flow_dbg(conn, "%s", tcp_splice_event_str[flag_index]); > } > > - if (tcp_splice_epoll_ctl(c, conn)) > + if (tcp_splice_epoll_ctl(conn)) > conn_flag(c, conn, CLOSING); > } > > -#define conn_event(c, conn, event) \ > +#define conn_event(conn, event) \ > do { \ > flow_trace(conn, "event at %s:%i",__func__, __LINE__); \ > - conn_event_do(c, conn, event); \ > + conn_event_do(conn, event); \ > } while (0) > > > @@ -320,7 +316,7 @@ static int tcp_splice_connect_finish(const struct ctx *c, > } > > if (!(conn->events & SPLICE_ESTABLISHED)) > - conn_event(c, conn, SPLICE_ESTABLISHED); > + conn_event(conn, SPLICE_ESTABLISHED); > > return 0; > } > @@ -367,7 +363,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) > > pif_sockaddr(c, &sa, tgtpif, &tgt->eaddr, tgt->eport); > > - conn_event(c, conn, SPLICE_CONNECT); > + conn_event(conn, SPLICE_CONNECT); > > if (connect(conn->s[1], &sa.sa, socklen_inany(&sa))) { > if (errno != EINPROGRESS) { > @@ -376,7 +372,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) > return -errno; > } > } else { > - conn_event(c, conn, SPLICE_ESTABLISHED); > + conn_event(conn, SPLICE_ESTABLISHED); > return tcp_splice_connect_finish(c, conn); > } > > @@ -485,14 +481,14 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, > > if (events & EPOLLOUT) { > fromsidei = !evsidei; > - conn_event(c, conn, ~OUT_WAIT(evsidei)); > + conn_event(conn, ~OUT_WAIT(evsidei)); > } else { > fromsidei = evsidei; > } > > if (events & EPOLLRDHUP) > /* For side 0 this is fake, but implied */ > - conn_event(c, conn, FIN_RCVD(evsidei)); > + conn_event(conn, FIN_RCVD(evsidei)); > > swap: > eof = 0; > @@ -574,7 +570,7 @@ retry: > if (conn->read[fromsidei] == conn->written[fromsidei]) > break; > > - conn_event(c, conn, OUT_WAIT(!fromsidei)); > + conn_event(conn, OUT_WAIT(!fromsidei)); > break; > } > > @@ -596,7 +592,7 @@ retry: > if ((conn->events & FIN_RCVD(sidei)) && > !(conn->events & FIN_SENT(!sidei))) { > shutdown(conn->s[!sidei], SHUT_WR); > - conn_event(c, conn, FIN_SENT(!sidei)); > + conn_event(conn, FIN_SENT(!sidei)); > } > } > } > diff --git a/udp_flow.c b/udp_flow.c > index 42f3622d621c..07ff589670c1 100644 > --- a/udp_flow.c > +++ b/udp_flow.c > @@ -84,7 +84,7 @@ static int udp_flow_sock(const struct ctx *c, > } > > flow_epollid_set(&uflow->f, EPOLLFD_ID_DEFAULT); > - rc = flow_epoll_set(c, EPOLL_TYPE_UDP, &uflow->f, EPOLLIN, s, sidei); > + rc = flow_epoll_set(EPOLL_TYPE_UDP, &uflow->f, EPOLLIN, s, sidei); > if (rc < 0) { > close(s); > return rc; > -- > 2.51.1 > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson