public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Asahi Lina <lina@asahilina.net>,
	Sergio Lopez <slp@redhat.com>, Jon Maloy <jmaloy@redhat.com>
Subject: Re: [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest
Date: Fri, 24 Jan 2025 14:06:04 +1100	[thread overview]
Message-ID: <Z5MDnBns4lIfk4Ic@zatzit> (raw)
In-Reply-To: <20250116203250.784496-5-sbrivio@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5888 bytes --]

On Thu, Jan 16, 2025 at 09:32:50PM +0100, Stefano Brivio wrote:
> There are pretty much two cases of the (misnomer) STALLED: in one
> case, we could send more data to the guest if it becomes available,
> and in another case, we can't, because we filled the window.
> 
> If, in this second case, we keep EPOLLIN enabled, but never read from
> the socket, we get short but CPU-annoying storms of EPOLLIN events,
> upon which we reschedule the ACK timeout handler, never read from the
> socket, go back to epoll_wait(), and so on:
> 
>   timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
>   epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
>   timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
>   epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
>   timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
>   epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
> 
> also known as:
> 
>   29.1517: Flow 2 (TCP connection): timer expires in 2.000s
>   29.1517: Flow 2 (TCP connection): timer expires in 2.000s
>   29.1517: Flow 2 (TCP connection): timer expires in 2.000s
> 
> which, for some reason, becomes very visible with muvm and aria2c
> downloading from a server nearby in parallel chunks.
> 
> That's because EPOLLIN isn't cleared if we don't read from the socket,
> and even with EPOLLET, epoll_wait() will repeatedly wake us up until
> we actually read something.
> 
> In this case, we don't want to subscribe to EPOLLIN at all: all we're
> waiting for is an ACK segment from the guest. Differentiate this case
> with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
> indicate that we're waiting for an ACK from the guest
> (ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.
> 
> If this flag is set before we set STALLED, EPOLLIN will be masked
> while we set EPOLLET because of STALLED. Whenever we clear STALLED,
> we also clear this flag.
> 
> This is definitely not elegant, but it's a minimal fix.
> 
> We can probably simplify this at a later point by having a category
> of connection flags directly corresponding to epoll flags, and
> dropping STALLED altogether, or, perhaps, always using EPOLLET (but
> we need a mechanism to re-check sockets for pending data if we can't
> temporarily write to the guest).
> 
> I suspect that this might also be implied in
> https://github.com/containers/podman/issues/23686, hence the Link:
> tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
> that).
> 
> Link: https://github.com/containers/podman/issues/23686
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c      | 8 ++++++--
>  tcp_buf.c  | 2 ++
>  tcp_conn.h | 1 +
>  tcp_vu.c   | 2 ++
>  4 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index ef33388..3b3193a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -345,7 +345,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
>  
>  static const char *tcp_flag_str[] __attribute((__unused__)) = {
>  	"STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> -	"ACK_FROM_TAP_DUE",
> +	"ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS",
>  };
>  
>  /* Listening sockets, used for automatic port forwarding in pasta mode only */
> @@ -436,8 +436,12 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
>  		if (events & TAP_FIN_SENT)
>  			return EPOLLET;
>  
> -		if (conn_flags & STALLED)
> +		if (conn_flags & STALLED) {
> +			if (conn_flags & ACK_FROM_TAP_BLOCKS)
> +				return EPOLLRDHUP | EPOLLET;
> +
>  			return EPOLLIN | EPOLLRDHUP | EPOLLET;
> +		}
>  
>  		return EPOLLIN | EPOLLRDHUP;
>  	}
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 8c15101..cbefa42 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -309,6 +309,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  	}
>  
>  	if (!wnd_scaled || already_sent >= wnd_scaled) {
> +		conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
>  		conn_flag(c, conn, STALLED);
>  		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  		return 0;
> @@ -387,6 +388,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  		return 0;
>  	}
>  
> +	conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
>  	conn_flag(c, conn, ~STALLED);
>  
>  	send_bufs = DIV_ROUND_UP(len, mss);
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 6ae0511..d342680 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -77,6 +77,7 @@ struct tcp_tap_conn {
>  #define ACTIVE_CLOSE		BIT(2)
>  #define ACK_TO_TAP_DUE		BIT(3)
>  #define ACK_FROM_TAP_DUE	BIT(4)
> +#define ACK_FROM_TAP_BLOCKS	BIT(5)
>  
>  #define SNDBUF_BITS		24
>  	unsigned int	sndbuf		:SNDBUF_BITS;
> diff --git a/tcp_vu.c b/tcp_vu.c
> index 8256f53..a216bb1 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -381,6 +381,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  	}
>  
>  	if (!wnd_scaled || already_sent >= wnd_scaled) {
> +		conn_flag(c, conn, ACK_FROM_TAP_BLOCKS);
>  		conn_flag(c, conn, STALLED);
>  		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  		return 0;
> @@ -423,6 +424,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  		return 0;
>  	}
>  
> +	conn_flag(c, conn, ~ACK_FROM_TAP_BLOCKS);
>  	conn_flag(c, conn, ~STALLED);
>  
>  	/* Likely, some new data was acked too. */

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      reply	other threads:[~2025-01-24  3:38 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-16 20:32 [PATCH 0/4] Fixes for EAGAIN/EPOLLIN storm and related issues Stefano Brivio
2025-01-16 20:32 ` [PATCH 1/4] tcp: Fix ACK sequence getting out of sync on EPOLLOUT wake-up Stefano Brivio
2025-01-24  2:57   ` David Gibson
2025-01-16 20:32 ` [PATCH 2/4] tcp: Don't subscribe to EPOLLOUT events on STALLED Stefano Brivio
2025-01-24  2:58   ` David Gibson
2025-01-16 20:32 ` [PATCH 3/4] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
2025-01-24  3:00   ` David Gibson
2025-01-16 20:32 ` [PATCH 4/4] tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest Stefano Brivio
2025-01-24  3:06   ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z5MDnBns4lIfk4Ic@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=jmaloy@redhat.com \
    --cc=lina@asahilina.net \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    --cc=slp@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).