public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Laurent Vivier <lvivier@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 5/7] flow: Use epoll_id_to_fd[EPOLLFD_ID_DEFAULT] in flow_epollid_set()
Date: Sat, 27 Dec 2025 14:46:12 +1100	[thread overview]
Message-ID: <aU9WhOTS55oiOvBE@zatzit> (raw)
In-Reply-To: <20251219164518.930012-6-lvivier@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 11699 bytes --]

On Fri, Dec 19, 2025 at 05:45:16PM +0100, Laurent Vivier wrote:
> In tcp_epoll_ctl() and tcp_splice_epoll_ctl(), flow_epollid_set() is
> called after flow_epoll_set() to set the epollid to EPOLLFD_ID_DEFAULT.
> When updating an existing flow (EPOLL_CTL_MOD), flow_epoll_set() already
> retrieves the correct epoll fd via flow_epollfd().  For new additions
> (EPOLL_CTL_ADD), we can use epoll_id_to_fd[EPOLLFD_ID_DEFAULT] directly
> instead of c->epollfd, since that's the epollid that will be set.

This comment is kind of pre-existing and kind of more related to the
previous patch.  But, the way we handle 'in_epoll' versus
EPOLL_CTL_ADD and EPOLL_CTL_MOD has always bothered me.  We're using a
bit in the flow structure to make exactly one transition and affect
exactly one call to epoll_ctl().

Moreover the semantics of in_epoll are kind of confusing.  It assumes
all fds of the flow are synchronized in being in/not-in the epoll.
That's basically true, but it's a somewhat non-obvious constraint.
Worse, it doesn't truly track the "not in" state - we don't clear it
when we remove the fd from epoll at the end of the flow's lifetime:
and we don't want to (just) do that, because that would mean we a
stray call to flow_epoll_set() would incorrectly re-add the fd of a
dying flow to the epoll.  This last is (part of) what's going on in
bug 165.

I've thought for a while we might be better off to always keep fds in
the epoll for the full flow lifecycle: we add them as part of the flow
and socket creation path, remove them on close.  In between we always
use EPOLL_CTL_MOD.  That removes the logical need for the in_epoll
bit.  You'd still need the epollfd for multiqueue handling of course,
so it wouldn't actually save any space.  Still, I'm beginning to think
making that change might simplify what you do in this series enough to
be worth it.

If we do need to "quiesce" fds for a while (I'm not sure we do at
present), we can set them to an empty event mask, rather than entirely
remove them from the epoll.  That does mean we need to be prepared for
EPOLLERR events (which can;t be masked) at any time - but that strikes
me as probably a good idea, anyway.

> This removes the need for the ctx parameter from flow_epoll_set() and
> simplifies the API.  The change cascades through all callers: tcp.c,
> icmp.c, udp_flow.c, and tcp_splice.c.
> 
> In tcp_splice.c, removing ctx from flow_epoll_set() calls allows us to
> also remove the ctx parameter from tcp_splice_epoll_ctl() and
> conn_event_do(), further simplifying the splice connection handling.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Comments above might change the context.   But in the context of
what's already here, this change makes sense to me, so,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  flow.c       |  8 +++-----
>  flow.h       |  5 ++---
>  icmp.c       |  2 +-
>  tcp.c        |  6 +++---
>  tcp_splice.c | 32 ++++++++++++++------------------
>  udp_flow.c   |  2 +-
>  6 files changed, 24 insertions(+), 31 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index 9c98102e3616..a3f75015a078 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -393,7 +393,6 @@ void flow_epollid_clear(struct flow_common *f)
>  
>  /**
>   * flow_epoll_set() - Add or modify epoll registration for a flow socket
> - * @c:		Execution context
>   * @type:	epoll type
>   * @f:		Flow to register socket for
>   * @events:	epoll events to watch for
> @@ -402,9 +401,8 @@ void flow_epollid_clear(struct flow_common *f)
>   *
>   * Return: 0 on success, -1 on error (from epoll_ctl())
>   */
> -int flow_epoll_set(const struct ctx *c, enum epoll_type type,
> -		   const struct flow_common *f, uint32_t events, int fd,
> -		   unsigned int sidei)
> +int flow_epoll_set(enum epoll_type type, const struct flow_common *f,
> +		   uint32_t events, int fd, unsigned int sidei)
>  {
>  	struct epoll_event ev;
>  	union epoll_ref ref;
> @@ -423,7 +421,7 @@ int flow_epoll_set(const struct ctx *c, enum epoll_type type,
>  			epollfd = flow_epollfd(f);
>  		} else {
>  			m = EPOLL_CTL_ADD;
> -			epollfd = c->epollfd;
> +			epollfd = epoll_id_to_fd[EPOLLFD_ID_DEFAULT];
>  		}
>  		break;
>  	case EPOLL_TYPE_TCP_TIMER: {
> diff --git a/flow.h b/flow.h
> index 388fab124238..9740ede8963e 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -265,9 +265,8 @@ bool flow_in_epoll(const struct flow_common *f);
>  int flow_epollfd(const struct flow_common *f);
>  void flow_epollid_set(struct flow_common *f, int epollid);
>  void flow_epollid_clear(struct flow_common *f);
> -int flow_epoll_set(const struct ctx *c, enum epoll_type type,
> -		   const struct flow_common *f, uint32_t events, int fd,
> -		   unsigned int sidei);
> +int flow_epoll_set(enum epoll_type type, const struct flow_common *f,
> +		   uint32_t events, int fd, unsigned int sidei);
>  void flow_epollid_register(int epollid, int epollfd);
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
>  int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> diff --git a/icmp.c b/icmp.c
> index 235759567060..f0b6fcff1776 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -210,7 +210,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
>  		goto cancel;
>  
>  	flow_epollid_set(&pingf->f, EPOLLFD_ID_DEFAULT);
> -	if (flow_epoll_set(c, EPOLL_TYPE_PING, &pingf->f, EPOLLIN, pingf->sock,
> +	if (flow_epoll_set(EPOLL_TYPE_PING, &pingf->f, EPOLLIN, pingf->sock,
>  			   TGTSIDE) < 0) {
>  		close(pingf->sock);
>  		goto cancel;
> diff --git a/tcp.c b/tcp.c
> index 2907af282551..ec4f6d2ecc04 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -543,14 +543,14 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  
>  	events = tcp_conn_epoll_events(conn->events, conn->flags);
>  
> -	if (flow_epoll_set(c, EPOLL_TYPE_TCP, &conn->f, events, conn->sock,
> +	if (flow_epoll_set(EPOLL_TYPE_TCP, &conn->f, events, conn->sock,
>  			   !TAPSIDE(conn)))
>  		return -errno;
>  
>  	flow_epollid_set(&conn->f, EPOLLFD_ID_DEFAULT);
>  
>  	if (conn->timer != -1) {
> -		if (flow_epoll_set(c, EPOLL_TYPE_TCP_TIMER, &conn->f,
> +		if (flow_epoll_set(EPOLL_TYPE_TCP_TIMER, &conn->f,
>  				   EPOLLIN | EPOLLET, conn->timer, 0))
>  			return -errno;
>  	}
> @@ -582,7 +582,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  			return;
>  		}
>  
> -		if (flow_epoll_set(c, EPOLL_TYPE_TCP_TIMER, &conn->f,
> +		if (flow_epoll_set(EPOLL_TYPE_TCP_TIMER, &conn->f,
>  				   EPOLLIN | EPOLLET, fd, 0)) {
>  			flow_dbg_perror(conn, "failed to add timer");
>  			close(fd);
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 7367f435367b..ccf627f4427d 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -135,22 +135,20 @@ static uint32_t tcp_splice_conn_epoll_events(uint16_t events, unsigned sidei)
>  
>  /**
>   * tcp_splice_epoll_ctl() - Add/modify/delete epoll state from connection events
> - * @c:		Execution context
>   * @conn:	Connection pointer
>   *
>   * Return: 0 on success, negative error code on failure (not on deletion)
>   */
> -static int tcp_splice_epoll_ctl(const struct ctx *c,
> -				struct tcp_splice_conn *conn)
> +static int tcp_splice_epoll_ctl(struct tcp_splice_conn *conn)
>  {
>  	uint32_t events[2];
>  
>  	events[0] = tcp_splice_conn_epoll_events(conn->events, 0);
>  	events[1] = tcp_splice_conn_epoll_events(conn->events, 1);
>  
> -	if (flow_epoll_set(c, EPOLL_TYPE_TCP_SPLICE, &conn->f, events[0],
> +	if (flow_epoll_set(EPOLL_TYPE_TCP_SPLICE, &conn->f, events[0],
>  			   conn->s[0], 0) ||
> -	    flow_epoll_set(c, EPOLL_TYPE_TCP_SPLICE, &conn->f, events[1],
> +	    flow_epoll_set(EPOLL_TYPE_TCP_SPLICE, &conn->f, events[1],
>  			   conn->s[1], 1)) {
>  		int ret = -errno;
>  		flow_perror(conn, "ERROR on epoll_ctl()");
> @@ -204,12 +202,10 @@ static void conn_flag_do(struct tcp_splice_conn *conn,
>  
>  /**
>   * conn_event_do() - Set and log connection events, update epoll state
> - * @c:		Execution context
>   * @conn:	Connection pointer
>   * @event:	Connection event
>   */
> -static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
> -			  unsigned long event)
> +static void conn_event_do(struct tcp_splice_conn *conn, unsigned long event)
>  {
>  	if (event & (event - 1)) {
>  		int flag_index = fls(~event);
> @@ -231,14 +227,14 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
>  			flow_dbg(conn, "%s", tcp_splice_event_str[flag_index]);
>  	}
>  
> -	if (tcp_splice_epoll_ctl(c, conn))
> +	if (tcp_splice_epoll_ctl(conn))
>  		conn_flag(c, conn, CLOSING);
>  }
>  
> -#define conn_event(c, conn, event)					\
> +#define conn_event(conn, event)					\
>  	do {								\
>  		flow_trace(conn, "event at %s:%i",__func__, __LINE__);	\
> -		conn_event_do(c, conn, event);				\
> +		conn_event_do(conn, event);				\
>  	} while (0)
>  
>  
> @@ -320,7 +316,7 @@ static int tcp_splice_connect_finish(const struct ctx *c,
>  	}
>  
>  	if (!(conn->events & SPLICE_ESTABLISHED))
> -		conn_event(c, conn, SPLICE_ESTABLISHED);
> +		conn_event(conn, SPLICE_ESTABLISHED);
>  
>  	return 0;
>  }
> @@ -367,7 +363,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
>  
>  	pif_sockaddr(c, &sa, tgtpif, &tgt->eaddr, tgt->eport);
>  
> -	conn_event(c, conn, SPLICE_CONNECT);
> +	conn_event(conn, SPLICE_CONNECT);
>  
>  	if (connect(conn->s[1], &sa.sa, socklen_inany(&sa))) {
>  		if (errno != EINPROGRESS) {
> @@ -376,7 +372,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
>  			return -errno;
>  		}
>  	} else {
> -		conn_event(c, conn, SPLICE_ESTABLISHED);
> +		conn_event(conn, SPLICE_ESTABLISHED);
>  		return tcp_splice_connect_finish(c, conn);
>  	}
>  
> @@ -485,14 +481,14 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
>  
>  	if (events & EPOLLOUT) {
>  		fromsidei = !evsidei;
> -		conn_event(c, conn, ~OUT_WAIT(evsidei));
> +		conn_event(conn, ~OUT_WAIT(evsidei));
>  	} else {
>  		fromsidei = evsidei;
>  	}
>  
>  	if (events & EPOLLRDHUP)
>  		/* For side 0 this is fake, but implied */
> -		conn_event(c, conn, FIN_RCVD(evsidei));
> +		conn_event(conn, FIN_RCVD(evsidei));
>  
>  swap:
>  	eof = 0;
> @@ -574,7 +570,7 @@ retry:
>  			if (conn->read[fromsidei] == conn->written[fromsidei])
>  				break;
>  
> -			conn_event(c, conn, OUT_WAIT(!fromsidei));
> +			conn_event(conn, OUT_WAIT(!fromsidei));
>  			break;
>  		}
>  
> @@ -596,7 +592,7 @@ retry:
>  			if ((conn->events & FIN_RCVD(sidei)) &&
>  			    !(conn->events & FIN_SENT(!sidei))) {
>  				shutdown(conn->s[!sidei], SHUT_WR);
> -				conn_event(c, conn, FIN_SENT(!sidei));
> +				conn_event(conn, FIN_SENT(!sidei));
>  			}
>  		}
>  	}
> diff --git a/udp_flow.c b/udp_flow.c
> index 42f3622d621c..07ff589670c1 100644
> --- a/udp_flow.c
> +++ b/udp_flow.c
> @@ -84,7 +84,7 @@ static int udp_flow_sock(const struct ctx *c,
>  	}
>  
>  	flow_epollid_set(&uflow->f, EPOLLFD_ID_DEFAULT);
> -	rc = flow_epoll_set(c, EPOLL_TYPE_UDP, &uflow->f, EPOLLIN, s, sidei);
> +	rc = flow_epoll_set(EPOLL_TYPE_UDP, &uflow->f, EPOLLIN, s, sidei);
>  	if (rc < 0) {
>  		close(s);
>  		return rc;
> -- 
> 2.51.1
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2025-12-27  3:46 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19 16:45 [PATCH 0/7] flow: Introduce flow_epoll_set() to centralize epoll operations Laurent Vivier
2025-12-19 16:45 ` [PATCH 1/7] tcp: Update EPOLL_TYPE_TCP_TIMER fd Laurent Vivier
2025-12-20  7:40   ` David Gibson
2025-12-19 16:45 ` [PATCH 2/7] udp_flow: Assign socket to flow inside udp_flow_sock() Laurent Vivier
2025-12-20  9:43   ` David Gibson
2025-12-19 16:45 ` [PATCH 3/7] tcp_splice: Refactor tcp_splice_conn_epoll_events() to per-side computation Laurent Vivier
2025-12-22  5:46   ` David Gibson
2025-12-19 16:45 ` [PATCH 4/7] flow: Introduce flow_epoll_set() to centralize epoll operations Laurent Vivier
2025-12-26  5:33   ` David Gibson
2025-12-19 16:45 ` [PATCH 5/7] flow: Use epoll_id_to_fd[EPOLLFD_ID_DEFAULT] in flow_epollid_set() Laurent Vivier
2025-12-27  3:46   ` David Gibson [this message]
2025-12-19 16:45 ` [PATCH 6/7] flow: Compute epoll events inside flow_epoll_set() Laurent Vivier
2025-12-27  4:05   ` David Gibson
2025-12-19 16:45 ` [PATCH 7/7] flow: Have flow_epoll_set() retrieve file descriptor from flow structure Laurent Vivier
2025-12-27  4:43   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aU9WhOTS55oiOvBE@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).