public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: Asahi Lina <lina@asahilina.net>, Sergio Lopez <slp@redhat.com>,
	Jon Maloy <jmaloy@redhat.com>,
	Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
Date: Thu, 16 Jan 2025 01:45:00 +0100	[thread overview]
Message-ID: <20250116014500.7da58696@elisabeth> (raw)
In-Reply-To: <20250114223316.3995265-1-sbrivio@redhat.com>

On Tue, 14 Jan 2025 23:33:16 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> Before SO_PEEK_OFF support was introduced by commit e63d281871ef
> ("tcp: leverage support of SO_PEEK_OFF socket option when available"),
> we would peek data from sockets using a "discard" buffer as first
> iovec element, so that, unless we had no pending data at all, we would
> always get a positive return code from recvmsg() (except for closing
> connections or errors).
> 
> If we couldn't send more data to the guest, in the window, we would
> set the STALLED flag (causing the epoll descriptor to switch to
> edge-triggered mode), and return early from tcp_data_from_sock().
> 
> With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
> on the socket, but nothing beyond our current peeking offset, we'll
> get EAGAIN instead of our current "discard" length. In that case, we
> return even earlier, and we don't set EPOLLET on the socket as a
> result.
> 
> As reported by Asahi Lina, this causes event loops where the kernel is
> signalling socket readiness, because there's data we didn't dequeue
> yet (waiting for the guest to acknowledge it), but we won't actually
> peek anything new, and return early without setting EPOLLET.
> 
> This is the original report, mentioning the originally proposed fix:
> 
> --
> When there is unacknowledged data in the inbound socket buffer, passt
> leaves the socket in the epoll instance to accept new data from the
> server. Since there is already data in the socket buffer, an epoll
> without EPOLLET will repeatedly fire while no data is processed,
> busy-looping the CPU:
> 
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> 
> Add in the missing EPOLLET flag for this case. This brings CPU
> usage down from around ~80% when downloading over TCP, to ~5% (use
> case: passt as network transport for muvm, downloading Steam games).
> --
> 
> we can't set EPOLLET unconditionally though, at least right now,
> because we don't monitor the guest tap for EPOLLOUT in case we fail
> to write on that side because we filled up that buffer (and not the
> window of a TCP connection).
> 
> Instead, rely on the observation that we only get EAGAIN on recvmsg()
> if we are attempting to peek data from a socket with a non-zero
> peeking offset: we only peek when there's pending data on a socket,
> and in that case, if we peek without offset, we'll always see some
> data.
> 
> And if we peek data with a non-zero offset and get EAGAIN, that means
> that we're either waiting for more data to arrive on the socket (which
> would cause further wake-ups, even with EPOLLET), or we're waiting for
> the guest to acknowledge some of it, which would anyway cause a
> wake-up.
> 
> In that case, it's safe to set STALLED and, in turn, EPOLLET on the
> socket, which fixes the EPOLLIN event loop.
> 
> Reported-by: Asahi Lina <lina@asahilina.net>
> Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp_buf.c | 1 +
>  tcp_vu.c  | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/tcp_buf.c b/tcp_buf.c
> index a975a55..577467b 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -359,6 +359,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  			return -errno;
>  		}
>  
> +		conn_flag(c, conn, STALLED);
>  		return 0;
>  	}
>  
> diff --git a/tcp_vu.c b/tcp_vu.c
> index 10e17d3..22ae71d 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -399,6 +399,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  			tcp_rst(c, conn);
>  			return len;
>  		}
> +
> +		conn_flag(c, conn, STALLED);

This occasionally makes the tests hang at:

  passt_vu_in_ns/tcp [19/32] - TCP/IPv6: guest to host: big transfer

for the vhost-user case only. Yes, guest to host.

I didn't really check why and how, I thought the vhost-user path here
would pretty much match the other one, but that doesn't seem to be
the case.

Perhaps we don't always wake up if the guest acknowledges something.

>  		return 0;
>  	}
>  

-- 
Stefano


      reply	other threads:[~2025-01-16  0:45 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-14 22:33 [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
2025-01-16  0:45 ` Stefano Brivio [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250116014500.7da58696@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=lina@asahilina.net \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=slp@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).