From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202412 header.b=pEWHk8ng; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 7AC135A0272 for ; Fri, 24 Jan 2025 04:38:37 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202412; t=1737689892; bh=O2UWZcjvKfXfcprsH2U6GKfBwKxvtfnX+D6vLw+dZwI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=pEWHk8ngheWqLLMNtsMUdx8DXVqBxdPUG0Y0oRKbwHjBIhe7cUK0ZDS3dh5WBnzWs VLIJv/k7a/P6NKghWUZUfYRYDY8zPOJ/6VCUiy0du8Y2uD8jB40EtTg7yZUiL6nbds MXb7DgTsgJls3tepxKYxvCeH5UY0L43rJ2dlcPXqSRtoEHx1XHZetZm9cMg6/4929T c5fxqz925W84ZdSQB+aSmrM8HPwf124j/rUXIMX9J11bBz4sCl5L2KspON/oMaG75P umMmQotjldsyy54D4vN1UWVKRRafrGcrNAcTotKYUd0LWw2mZS4bltLMQPVCW9IOQS 65u3XdekVKj5g== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YfNnJ3qPmz4wp0; Fri, 24 Jan 2025 14:38:12 +1100 (AEDT) Date: Fri, 24 Jan 2025 14:06:04 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest Message-ID: References: <20250116203250.784496-1-sbrivio@redhat.com> <20250116203250.784496-5-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="vYLf/q3ODOMJLTdQ" Content-Disposition: inline In-Reply-To: <20250116203250.784496-5-sbrivio@redhat.com> Message-ID-Hash: SF4OKUWVPCHOQ7GL6WAX5IXRNJQDDLAP X-Message-ID-Hash: SF4OKUWVPCHOQ7GL6WAX5IXRNJQDDLAP X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Asahi Lina , Sergio Lopez , Jon Maloy X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --vYLf/q3ODOMJLTdQ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 16, 2025 at 09:32:50PM +0100, Stefano Brivio wrote: > There are pretty much two cases of the (misnomer) STALLED: in one > case, we could send more data to the guest if it becomes available, > and in another case, we can't, because we filled the window. >=20 > If, in this second case, we keep EPOLLIN enabled, but never read from > the socket, we get short but CPU-annoying storms of EPOLLIN events, > upon which we reschedule the ACK timeout handler, never read from the > socket, go back to epoll_wait(), and so on: >=20 > timerfd_settime(76, 0, {it_interval=3D{tv_sec=3D0, tv_nsec=3D0}, it_val= ue=3D{tv_sec=3D2, tv_nsec=3D0}}, NULL) =3D 0 > epoll_wait(3, [{events=3DEPOLLIN, data=3D{u32=3D10497, u64=3D3865471616= 1}}], 8, 1000) =3D 1 > timerfd_settime(76, 0, {it_interval=3D{tv_sec=3D0, tv_nsec=3D0}, it_val= ue=3D{tv_sec=3D2, tv_nsec=3D0}}, NULL) =3D 0 > epoll_wait(3, [{events=3DEPOLLIN, data=3D{u32=3D10497, u64=3D3865471616= 1}}], 8, 1000) =3D 1 > timerfd_settime(76, 0, {it_interval=3D{tv_sec=3D0, tv_nsec=3D0}, it_val= ue=3D{tv_sec=3D2, tv_nsec=3D0}}, NULL) =3D 0 > epoll_wait(3, [{events=3DEPOLLIN, data=3D{u32=3D10497, u64=3D3865471616= 1}}], 8, 1000) =3D 1 >=20 > also known as: >=20 > 29.1517: Flow 2 (TCP connection): timer expires in 2.000s > 29.1517: Flow 2 (TCP connection): timer expires in 2.000s > 29.1517: Flow 2 (TCP connection): timer expires in 2.000s >=20 > which, for some reason, becomes very visible with muvm and aria2c > downloading from a server nearby in parallel chunks. >=20 > That's because EPOLLIN isn't cleared if we don't read from the socket, > and even with EPOLLET, epoll_wait() will repeatedly wake us up until > we actually read something. >=20 > In this case, we don't want to subscribe to EPOLLIN at all: all we're > waiting for is an ACK segment from the guest. Differentiate this case > with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just > indicate that we're waiting for an ACK from the guest > (ACK_FROM_TAP_DUE), but also that we're blocked waiting for it. >=20 > If this flag is set before we set STALLED, EPOLLIN will be masked > while we set EPOLLET because of STALLED. Whenever we clear STALLED, > we also clear this flag. >=20 > This is definitely not elegant, but it's a minimal fix. >=20 > We can probably simplify this at a later point by having a category > of connection flags directly corresponding to epoll flags, and > dropping STALLED altogether, or, perhaps, always using EPOLLET (but > we need a mechanism to re-check sockets for pending data if we can't > temporarily write to the guest). >=20 > I suspect that this might also be implied in > https://github.com/containers/podman/issues/23686, hence the Link: > tag. It doesn't necessarily mean I'm fixing it (I can't reproduce > that). >=20 > Link: https://github.com/containers/podman/issues/23686 > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson > --- > tcp.c | 8 ++++++-- > tcp_buf.c | 2 ++ > tcp_conn.h | 1 + > tcp_vu.c | 2 ++ > 4 files changed, 11 insertions(+), 2 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index ef33388..3b3193a 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -345,7 +345,7 @@ static const char *tcp_state_str[] __attribute((__unu= sed__)) =3D { > =20 > static const char *tcp_flag_str[] __attribute((__unused__)) =3D { > "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", > - "ACK_FROM_TAP_DUE", > + "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", > }; > =20 > /* Listening sockets, used for automatic port forwarding in pasta mode o= nly */ > @@ -436,8 +436,12 @@ static uint32_t tcp_conn_epoll_events(uint8_t events= , uint8_t conn_flags) > if (events & TAP_FIN_SENT) > return EPOLLET; > =20 > - if (conn_flags & STALLED) > + if (conn_flags & STALLED) { > + if (conn_flags & ACK_FROM_TAP_BLOCKS) > + return EPOLLRDHUP | EPOLLET; > + > return EPOLLIN | EPOLLRDHUP | EPOLLET; > + } > =20 > return EPOLLIN | EPOLLRDHUP; > } > diff --git a/tcp_buf.c b/tcp_buf.c > index 8c15101..cbefa42 100644 > --- a/tcp_buf.c > +++ b/tcp_buf.c > @@ -309,6 +309,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struc= t tcp_tap_conn *conn) > } > =20 > if (!wnd_scaled || already_sent >=3D wnd_scaled) { > + conn_flag(c, conn, ACK_FROM_TAP_BLOCKS); > conn_flag(c, conn, STALLED); > conn_flag(c, conn, ACK_FROM_TAP_DUE); > return 0; > @@ -387,6 +388,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struc= t tcp_tap_conn *conn) > return 0; > } > =20 > + conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS); > conn_flag(c, conn, ~STALLED); > =20 > send_bufs =3D DIV_ROUND_UP(len, mss); > diff --git a/tcp_conn.h b/tcp_conn.h > index 6ae0511..d342680 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -77,6 +77,7 @@ struct tcp_tap_conn { > #define ACTIVE_CLOSE BIT(2) > #define ACK_TO_TAP_DUE BIT(3) > #define ACK_FROM_TAP_DUE BIT(4) > +#define ACK_FROM_TAP_BLOCKS BIT(5) > =20 > #define SNDBUF_BITS 24 > unsigned int sndbuf :SNDBUF_BITS; > diff --git a/tcp_vu.c b/tcp_vu.c > index 8256f53..a216bb1 100644 > --- a/tcp_vu.c > +++ b/tcp_vu.c > @@ -381,6 +381,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct= tcp_tap_conn *conn) > } > =20 > if (!wnd_scaled || already_sent >=3D wnd_scaled) { > + conn_flag(c, conn, ACK_FROM_TAP_BLOCKS); > conn_flag(c, conn, STALLED); > conn_flag(c, conn, ACK_FROM_TAP_DUE); > return 0; > @@ -423,6 +424,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct= tcp_tap_conn *conn) > return 0; > } > =20 > + conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS); > conn_flag(c, conn, ~STALLED); > =20 > /* Likely, some new data was acked too. */ --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --vYLf/q3ODOMJLTdQ Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmeTA5sACgkQzQJF27ox 2Gd2fBAAikmrLzrGGdHYkLrOe4GhjQrHpidztVLbIti/z83bsZtNQuUAfpUrTVnn 9DlmCGsL3JKHHAcG1iO6b8FwNohirsx8mjsKwtvrJ/98UYnFtLAfRiRYbCe/xzzn 11B+EUnjXns2bqgsedzE6mPs6gLSHrYVsu+j5nB7wt/X4g+MTVwNgGPZaluAAw+Z yxVBdXNA3KJ7TdC7vx4QHHtmswyYaTPl8b0tYj81hZZKytU+aUcf+dOIBJbvNAQh izbBRxngRCohT3WMV+Yr7d8qI41dIv6J9W7wjw7EOWF9S8AyPSbBXY0ClYrJx98P ahXUj0qwNwMrzaYGtuWngmsBJiv/tiFE2S2GoSt5UJwWlftA2htW/13DrmEkw052 iNUWV63t25bqkSdenwlsqR4Ta+elRoyaRJB3TGeiEciMFS82kvjgFF4P1gPA33Rc dIYan0Byj9lbcH0nsKSVs6uXxVyE+t54kv7gECCdyxqip5nyrSnJ9eaSIrVZEov2 dhHJH4d0m8RINWlL1qGTyyBv1jItqU4B2MiZT+rdL84TKCc/erV2mNTs2BxGSicy dR5oxEV1QBYXQPydwowSUUMzKX4HCxsyLf/ZZLSte1eHIdoYtX01Y0YWuQW9X9tt Ek40vT0T2E9EwqVlo3AjW/008djommHBjemAOAmbKipVW/osueE= =S2Rm -----END PGP SIGNATURE----- --vYLf/q3ODOMJLTdQ--