From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Asahi Lina <lina@asahilina.net>,
Sergio Lopez <slp@redhat.com>, Jon Maloy <jmaloy@redhat.com>
Subject: Re: [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest
Date: Fri, 24 Jan 2025 14:06:04 +1100 [thread overview]
Message-ID: <Z5MDnBns4lIfk4Ic@zatzit> (raw)
In-Reply-To: <20250116203250.784496-5-sbrivio@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 5888 bytes --]
On Thu, Jan 16, 2025 at 09:32:50PM +0100, Stefano Brivio wrote:
> There are pretty much two cases of the (misnomer) STALLED: in one
> case, we could send more data to the guest if it becomes available,
> and in another case, we can't, because we filled the window.
>
> If, in this second case, we keep EPOLLIN enabled, but never read from
> the socket, we get short but CPU-annoying storms of EPOLLIN events,
> upon which we reschedule the ACK timeout handler, never read from the
> socket, go back to epoll_wait(), and so on:
>
> timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
> epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
> timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
> epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
> timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
> epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
>
> also known as:
>
> 29.1517: Flow 2 (TCP connection): timer expires in 2.000s
> 29.1517: Flow 2 (TCP connection): timer expires in 2.000s
> 29.1517: Flow 2 (TCP connection): timer expires in 2.000s
>
> which, for some reason, becomes very visible with muvm and aria2c
> downloading from a server nearby in parallel chunks.
>
> That's because EPOLLIN isn't cleared if we don't read from the socket,
> and even with EPOLLET, epoll_wait() will repeatedly wake us up until
> we actually read something.
>
> In this case, we don't want to subscribe to EPOLLIN at all: all we're
> waiting for is an ACK segment from the guest. Differentiate this case
> with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
> indicate that we're waiting for an ACK from the guest
> (ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.
>
> If this flag is set before we set STALLED, EPOLLIN will be masked
> while we set EPOLLET because of STALLED. Whenever we clear STALLED,
> we also clear this flag.
>
> This is definitely not elegant, but it's a minimal fix.
>
> We can probably simplify this at a later point by having a category
> of connection flags directly corresponding to epoll flags, and
> dropping STALLED altogether, or, perhaps, always using EPOLLET (but
> we need a mechanism to re-check sockets for pending data if we can't
> temporarily write to the guest).
>
> I suspect that this might also be implied in
> https://github.com/containers/podman/issues/23686, hence the Link:
> tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
> that).
>
> Link: https://github.com/containers/podman/issues/23686
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 8 ++++++--
> tcp_buf.c | 2 ++
> tcp_conn.h | 1 +
> tcp_vu.c | 2 ++
> 4 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index ef33388..3b3193a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -345,7 +345,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
>
> static const char *tcp_flag_str[] __attribute((__unused__)) = {
> "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> - "ACK_FROM_TAP_DUE",
> + "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS",
> };
>
> /* Listening sockets, used for automatic port forwarding in pasta mode only */
> @@ -436,8 +436,12 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
> if (events & TAP_FIN_SENT)
> return EPOLLET;
>
> - if (conn_flags & STALLED)
> + if (conn_flags & STALLED) {
> + if (conn_flags & ACK_FROM_TAP_BLOCKS)
> + return EPOLLRDHUP | EPOLLET;
> +
> return EPOLLIN | EPOLLRDHUP | EPOLLET;
> + }
>
> return EPOLLIN | EPOLLRDHUP;
> }
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 8c15101..cbefa42 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -309,6 +309,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> }
>
> if (!wnd_scaled || already_sent >= wnd_scaled) {
> + conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
> conn_flag(c, conn, STALLED);
> conn_flag(c, conn, ACK_FROM_TAP_DUE);
> return 0;
> @@ -387,6 +388,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> return 0;
> }
>
> + conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
> conn_flag(c, conn, ~STALLED);
>
> send_bufs = DIV_ROUND_UP(len, mss);
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 6ae0511..d342680 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -77,6 +77,7 @@ struct tcp_tap_conn {
> #define ACTIVE_CLOSE BIT(2)
> #define ACK_TO_TAP_DUE BIT(3)
> #define ACK_FROM_TAP_DUE BIT(4)
> +#define ACK_FROM_TAP_BLOCKS BIT(5)
>
> #define SNDBUF_BITS 24
> unsigned int sndbuf :SNDBUF_BITS;
> diff --git a/tcp_vu.c b/tcp_vu.c
> index 8256f53..a216bb1 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -381,6 +381,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> }
>
> if (!wnd_scaled || already_sent >= wnd_scaled) {
> + conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
> conn_flag(c, conn, STALLED);
> conn_flag(c, conn, ACK_FROM_TAP_DUE);
> return 0;
> @@ -423,6 +424,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> return 0;
> }
>
> + conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
> conn_flag(c, conn, ~STALLED);
>
> /* Likely, some new data was acked too. */
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
prev parent reply other threads:[~2025-01-24 3:38 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-16 20:32 [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues Stefano Brivio
2025-01-16 20:32 ` [PATCH 1/4] tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up Stefano Brivio
2025-01-24 2:57 ` David Gibson
2025-01-16 20:32 ` [PATCH 2/4] tcp: Don't subscribe to EPOLLOUT events on STALLED Stefano Brivio
2025-01-24 2:58 ` David Gibson
2025-01-16 20:32 ` [PATCH 3/4] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
2025-01-24 3:00 ` David Gibson
2025-01-16 20:32 ` [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest Stefano Brivio
2025-01-24 3:06 ` David Gibson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z5MDnBns4lIfk4Ic@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=jmaloy@redhat.com \
--cc=lina@asahilina.net \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
--cc=slp@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).