From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202510 header.b=nP4uuaWG; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 73C075A061A for ; Tue, 07 Oct 2025 08:08:27 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202510; t=1759817304; bh=5zSZtjg5a6JvH0/TijjJxTBis707bbulc0Uhtcb1sqc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nP4uuaWG6KQCXvX0lbDp6vLH+sS8ZIE8CrQ/PsZPFzvdNuQwlivuGVp6j4W4m1XkJ g6LV66vI1Q831ZjSNpUgZJnsMNKGBBhHy49AtIFjTY3Ycs0VFW5myDvUDzQhu64rkk N5cU5XMZL3XOT3CHnJ48A7IM4v4+0Wwu5IDFpnu6zc5Pv9Q3hmGJhhkSu5TwGVLM22 zmBL/H8b9VKWNnhNlG1wCQkgu4XedOMFMJ+CtLkLyIMtKYkb/BihToZ0X9hjoKNwQc TtLZdhUON1ZYNLevlshsE6atKAA5zI4eSR9okqOjOIte0ObCqm0GCGOlmav5A7Nufa rx3QZFKBbOaRQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4cgm0S5F08z4wB8; Tue, 7 Oct 2025 17:08:24 +1100 (AEDT) Date: Tue, 7 Oct 2025 17:07:37 +1100 From: David Gibson To: Laurent Vivier Subject: Re: [PATCH 3/5] tcp, flow: Replace per-connection in_epoll flag with epollfd in flow_common Message-ID: References: <20251003152717.2437765-1-lvivier@redhat.com> <20251003152717.2437765-4-lvivier@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="9IivLi9pR0qNKJtl" Content-Disposition: inline In-Reply-To: <20251003152717.2437765-4-lvivier@redhat.com> Message-ID-Hash: GRAASPLTXPHGER7O4VBG3GLFK4ACARNF X-Message-ID-Hash: GRAASPLTXPHGER7O4VBG3GLFK4ACARNF X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --9IivLi9pR0qNKJtl Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Oct 03, 2025 at 05:27:15PM +0200, Laurent Vivier wrote: > The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked > whether a connection was registered with epoll, not which epoll instance. > This limited flexibility for future multi-epoll support. >=20 > Replace the boolean with an epollfd field in flow_common that serves dual > purpose: zero indicates not registered (replacing in_epoll=3Dfalse), non-= zero Don't use 0, since that's a valid fd. > stores the actual epoll fd (replacing in_epoll=3Dtrue). I am a bit nervous about adding 31-bits to every flow, since I think we're fairly close to a cacheline threshold. I'm not sure we really can add any less to flow_common, though, given alignment. Then again... we probably don't need 8 bites each for TYPE and STATE, so those could be packed tighter. Then we could use a limited-bits index into a table of epollfds, rather than a raw fd. Much uglier, but maybe worth it? > This change also simplifies tcp_timer_ctl() by removing the need to pass > the context 'c', since the epoll fd is now directly accessible from the > connection structure. >=20 > Signed-off-by: Laurent Vivier > --- > flow.c | 2 +- > flow.h | 2 ++ > tcp.c | 36 ++++++++++++++++++------------------ > tcp_conn.h | 8 +------- > tcp_splice.c | 23 +++++++++++------------ > 5 files changed, 33 insertions(+), 38 deletions(-) >=20 > diff --git a/flow.c b/flow.c > index b14e9d8b63ff..7c61ee87ae9d 100644 > --- a/flow.c > +++ b/flow.c > @@ -827,7 +827,7 @@ void flow_defer_handler(const struct ctx *c, const st= ruct timespec *now) > case FLOW_TCP_SPLICE: > closed =3D tcp_splice_flow_defer(&flow->tcp_splice); > if (!closed && timer) > - tcp_splice_timer(c, &flow->tcp_splice); > + tcp_splice_timer(&flow->tcp_splice); > break; > case FLOW_PING4: > case FLOW_PING6: > diff --git a/flow.h b/flow.h > index ef138b83add8..592d9e3792f6 100644 > --- a/flow.h > +++ b/flow.h > @@ -175,6 +175,7 @@ int flowside_connect(const struct ctx *c, int s, > * struct flow_common - Common fields for packet flows > * @state: State of the flow table entry > * @type: Type of packet flow > + * @epollfd: epoll instance flow is registered with (0 if not registered) > * @pif[]: Interface for each side of the flow > * @side[]: Information for each side of the flow > */ > @@ -190,6 +191,7 @@ struct flow_common { > static_assert(sizeof(uint8_t) * 8 >=3D FLOW_TYPE_BITS, > "Not enough bits for type field"); > #endif > + int epollfd; This should go after pif[] - it's a less logical order, but it will save 2 bytes of alignment padding. > uint8_t pif[SIDES]; > struct flowside side[SIDES]; > }; > diff --git a/tcp.c b/tcp.c > index 04725deabb65..c995b40f38f8 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -504,25 +504,26 @@ static uint32_t tcp_conn_epoll_events(uint8_t event= s, uint8_t conn_flags) > */ > static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > { > - int m =3D conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; > + int m =3D conn->f.epollfd ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; > union epoll_ref ref =3D { .type =3D EPOLL_TYPE_TCP, .fd =3D conn->sock, > .flowside =3D FLOW_SIDX(conn, !TAPSIDE(conn)), }; > struct epoll_event ev =3D { .data.u64 =3D ref.u64 }; > + int epollfd =3D conn->f.epollfd ? conn->f.epollfd : c->epollfd; > =20 > if (conn->events =3D=3D CLOSED) { > - if (conn->in_epoll) > - epoll_del(c->epollfd, conn->sock); > + if (conn->f.epollfd) > + epoll_del(epollfd, conn->sock); > if (conn->timer !=3D -1) > - epoll_del(c->epollfd, conn->timer); > + epoll_del(epollfd, conn->timer); > return 0; > } > =20 > ev.events =3D tcp_conn_epoll_events(conn->events, conn->flags); > =20 > - if (epoll_ctl(c->epollfd, m, conn->sock, &ev)) > + if (epoll_ctl(epollfd, m, conn->sock, &ev)) > return -errno; > =20 > - conn->in_epoll =3D true; > + conn->f.epollfd =3D epollfd; > =20 > if (conn->timer !=3D -1) { > union epoll_ref ref_t =3D { .type =3D EPOLL_TYPE_TCP_TIMER, > @@ -531,7 +532,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct = tcp_tap_conn *conn) > struct epoll_event ev_t =3D { .data.u64 =3D ref_t.u64, > .events =3D EPOLLIN | EPOLLET }; > =20 > - if (epoll_ctl(c->epollfd, EPOLL_CTL_MOD, conn->timer, &ev_t)) > + if (epoll_ctl(conn->f.epollfd, EPOLL_CTL_MOD, conn->timer, &ev_t)) > return -errno; > } > =20 > @@ -540,12 +541,11 @@ static int tcp_epoll_ctl(const struct ctx *c, struc= t tcp_tap_conn *conn) > =20 > /** > * tcp_timer_ctl() - Set timerfd based on flags/events, create timerfd i= f needed > - * @c: Execution context > * @conn: Connection pointer > * > * #syscalls timerfd_create timerfd_settime > */ > -static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > +static void tcp_timer_ctl(struct tcp_tap_conn *conn) > { > struct itimerspec it =3D { { 0 }, { 0 } }; > =20 > @@ -570,7 +570,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct= tcp_tap_conn *conn) > } > conn->timer =3D fd; > =20 > - if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) { > + if (epoll_ctl(conn->f.epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) { > flow_dbg_perror(conn, "failed to add timer"); > close(conn->timer); > conn->timer =3D -1; > @@ -628,7 +628,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap= _conn *conn, > * flags and factor this into the logic below. > */ > if (flag =3D=3D ACK_FROM_TAP_DUE) > - tcp_timer_ctl(c, conn); > + tcp_timer_ctl(conn); > =20 > return; > } > @@ -644,7 +644,7 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap= _conn *conn, > if (flag =3D=3D ACK_FROM_TAP_DUE || flag =3D=3D ACK_TO_TAP_DUE || > (flag =3D=3D ~ACK_FROM_TAP_DUE && (conn->flags & ACK_TO_TAP_DUE)) || > (flag =3D=3D ~ACK_TO_TAP_DUE && (conn->flags & ACK_FROM_TAP_DUE))) > - tcp_timer_ctl(c, conn); > + tcp_timer_ctl(conn); > } > =20 > /** > @@ -699,7 +699,7 @@ void conn_event_do(const struct ctx *c, struct tcp_ta= p_conn *conn, > tcp_epoll_ctl(c, conn); > =20 > if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) > - tcp_timer_ctl(c, conn); > + tcp_timer_ctl(conn); > } > =20 > /** > @@ -1732,7 +1732,7 @@ static int tcp_data_from_tap(const struct ctx *c, s= truct tcp_tap_conn *conn, > seq, conn->seq_from_tap); > =20 > tcp_send_flag(c, conn, ACK); > - tcp_timer_ctl(c, conn); > + tcp_timer_ctl(conn); > =20 > if (p->count =3D=3D 1) { > tcp_tap_window_update(c, conn, > @@ -2375,7 +2375,7 @@ void tcp_timer_handler(const struct ctx *c, union e= poll_ref ref) > =20 > if (conn->flags & ACK_TO_TAP_DUE) { > tcp_send_flag(c, conn, ACK_IF_NEEDED); > - tcp_timer_ctl(c, conn); > + tcp_timer_ctl(conn); > } else if (conn->flags & ACK_FROM_TAP_DUE) { > if (!(conn->events & ESTABLISHED)) { > flow_dbg(conn, "handshake timeout"); > @@ -2397,7 +2397,7 @@ void tcp_timer_handler(const struct ctx *c, union e= poll_ref ref) > return; > =20 > tcp_data_from_sock(c, conn); > - tcp_timer_ctl(c, conn); > + tcp_timer_ctl(conn); > } > } else { > struct itimerspec new =3D { { 0 }, { ACT_TIMEOUT, 0 } }; > @@ -3445,7 +3445,7 @@ int tcp_flow_migrate_source_ext(const struct ctx *c, > if (c->migrate_no_linger) > close(s); > else > - epoll_del(c->epollfd, s); > + epoll_del(conn->f.epollfd, s); > =20 > /* Adjustments unrelated to FIN segments: sequence numbers we dumped are > * based on the end of the queues. > @@ -3594,7 +3594,7 @@ static int tcp_flow_repair_connect(const struct ctx= *c, > return rc; > } > =20 > - conn->in_epoll =3D 0; > + conn->f.epollfd =3D 0; > conn->timer =3D -1; > conn->listening_sock =3D -1; > =20 > diff --git a/tcp_conn.h b/tcp_conn.h > index 38b5c541f003..81333122d531 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -12,7 +12,6 @@ > /** > * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) > * @f: Generic flow information > - * @in_epoll: Is the connection in the epoll set? > * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT > * @ws_from_tap: Window scaling factor advertised from tap/guest > * @ws_to_tap: Window scaling factor advertised to tap/guest > @@ -36,8 +35,6 @@ struct tcp_tap_conn { > /* Must be first element */ > struct flow_common f; > =20 > - bool in_epoll :1; > - > #define TCP_RETRANS_BITS 3 > unsigned int retrans :TCP_RETRANS_BITS; > #define TCP_MAX_RETRANS MAX_FROM_BITS(TCP_RETRANS_BITS) > @@ -196,7 +193,6 @@ struct tcp_tap_transfer_ext { > * @written: Bytes written (not fully written from one other side read) > * @events: Events observed/actions performed on connection > * @flags: Connection flags (attributes, not events) > - * @in_epoll: Is the connection in the epoll set? > */ > struct tcp_splice_conn { > /* Must be first element */ > @@ -220,8 +216,6 @@ struct tcp_splice_conn { > #define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0)) > #define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2)) > #define CLOSING BIT(4) > - > - bool in_epoll :1; > }; > =20 > /* Socket pools */ > @@ -245,7 +239,7 @@ int tcp_flow_migrate_target_ext(struct ctx *c, struct= tcp_tap_conn *conn, int fd > bool tcp_flow_is_established(const struct tcp_tap_conn *conn); > =20 > bool tcp_splice_flow_defer(struct tcp_splice_conn *conn); > -void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn); > +void tcp_splice_timer(struct tcp_splice_conn *conn); > int tcp_conn_pool_sock(int pool[]); > int tcp_conn_sock(sa_family_t af); > int tcp_sock_refill_pool(int pool[], sa_family_t af); > diff --git a/tcp_splice.c b/tcp_splice.c > index 666ee62b738f..49fb43473de6 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -149,7 +149,7 @@ static void tcp_splice_conn_epoll_events(uint16_t eve= nts, > static int tcp_splice_epoll_ctl(const struct ctx *c, > struct tcp_splice_conn *conn) > { > - int m =3D conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; > + int m =3D conn->f.epollfd ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; > const union epoll_ref ref[SIDES] =3D { > { .type =3D EPOLL_TYPE_TCP_SPLICE, .fd =3D conn->s[0], > .flowside =3D FLOW_SIDX(conn, 0) }, > @@ -158,28 +158,28 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, > }; > struct epoll_event ev[SIDES] =3D { { .data.u64 =3D ref[0].u64 }, > { .data.u64 =3D ref[1].u64 } }; > + int epollfd =3D conn->f.epollfd ? conn->f.epollfd : c->epollfd; > =20 > tcp_splice_conn_epoll_events(conn->events, ev); > =20 > - if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) || > - epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) { > + > + if (epoll_ctl(epollfd, m, conn->s[0], &ev[0]) || > + epoll_ctl(epollfd, m, conn->s[1], &ev[1])) { > int ret =3D -errno; > flow_perror(conn, "ERROR on epoll_ctl()"); > return ret; > } > - > - conn->in_epoll =3D true; > + conn->f.epollfd =3D epollfd; > =20 > return 0; > } > =20 > /** > * conn_flag_do() - Set/unset given flag, log, update epoll on CLOSING f= lag > - * @c: Execution context > * @conn: Connection pointer > * @flag: Flag to set, or ~flag to unset > */ > -static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *co= nn, > +static void conn_flag_do(struct tcp_splice_conn *conn, > unsigned long flag) > { > if (flag & (flag - 1)) { > @@ -204,15 +204,15 @@ static void conn_flag_do(const struct ctx *c, struc= t tcp_splice_conn *conn, > } > =20 > if (flag =3D=3D CLOSING) { > - epoll_del(c->epollfd, conn->s[0]); > - epoll_del(c->epollfd, conn->s[1]); > + epoll_del(conn->f.epollfd, conn->s[0]); > + epoll_del(conn->f.epollfd, conn->s[1]); > } > } > =20 > #define conn_flag(c, conn, flag) \ > do { \ > flow_trace(conn, "flag at %s:%i", __func__, __LINE__); \ > - conn_flag_do(c, conn, flag); \ > + conn_flag_do(conn, flag); \ > } while (0) > =20 > /** > @@ -751,10 +751,9 @@ void tcp_splice_init(struct ctx *c) > =20 > /** > * tcp_splice_timer() - Timer for spliced connections > - * @c: Execution context > * @conn: Connection to handle > */ > -void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn) > +void tcp_splice_timer(struct tcp_splice_conn *conn) > { > unsigned sidei; > =20 > --=20 > 2.50.1 >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --9IivLi9pR0qNKJtl Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmjkrigACgkQzQJF27ox 2GfUoRAArWi5d6JxuNQvRbsOctEMxkt223Q6Fs7YwHhiTGAFsXaLTYzw4rqy+Yfh 9cm3IqqH54u0YTj2mnqQP4S2K4QA7QnRUVZgOssLdmipEmWsqcVEyLlFmjwolHva csVpZarZDnO6Awh7CCAaHc2GhRmJmN3mHlsk90vnGtAoxf8KfRLEBuxV8uKa/eFd pqXNJJRDOZcon7rxH12fae9KQvPvdCnrBDOWL4dIPc/L4eXHHroDhwDpoI6WIRWY xVjGEWdz+P0xpx5RxFBwJgwP+2q9iAhKbQZiAKobjUHLtpBblVYl28qQFKlTezzs uyUICI67qzixY5LyHxZOhZiJVV0ExlGjqBqt1UE50L19a1SeaFRSbqdkUhl1P3gY h9wS84+L+IpMqumLReon4xYn7xSqGhjD5WBNDppOc4r4IEjVaC60f1rk//Rh8gzR WatfAWlOJPxrNtjGDsQnK/a610I47Tdk175iZ5L6ggKT5n9G5M0bSYUbn3mzcb0M QTP5jVmswGnlmgntzI2NJQmjtGN81xJfBzjK9hvUr/uozgpnqUSE8SH2D5y6F0FY zao0bo/n5QYwtf5w6oFDQmY4FGtCtanuCqpO8SWefUgyEfPaGM4GCRZd7W94ZeLK q05FtT+neoqUigx9E9Icu+srHCnTa3EJOamvWue/O8LhEEKLFlg= =C7ug -----END PGP SIGNATURE----- --9IivLi9pR0qNKJtl--