From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: Asahi Lina <lina@asahilina.net>, Sergio Lopez <slp@redhat.com>,
Jon Maloy <jmaloy@redhat.com>,
Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
Date: Thu, 16 Jan 2025 01:45:00 +0100 [thread overview]
Message-ID: <20250116014500.7da58696@elisabeth> (raw)
In-Reply-To: <20250114223316.3995265-1-sbrivio@redhat.com>
On Tue, 14 Jan 2025 23:33:16 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:
> Before SO_PEEK_OFF support was introduced by commit e63d281871ef
> ("tcp: leverage support of SO_PEEK_OFF socket option when available"),
> we would peek data from sockets using a "discard" buffer as first
> iovec element, so that, unless we had no pending data at all, we would
> always get a positive return code from recvmsg() (except for closing
> connections or errors).
>
> If we couldn't send more data to the guest, in the window, we would
> set the STALLED flag (causing the epoll descriptor to switch to
> edge-triggered mode), and return early from tcp_data_from_sock().
>
> With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
> on the socket, but nothing beyond our current peeking offset, we'll
> get EAGAIN instead of our current "discard" length. In that case, we
> return even earlier, and we don't set EPOLLET on the socket as a
> result.
>
> As reported by Asahi Lina, this causes event loops where the kernel is
> signalling socket readiness, because there's data we didn't dequeue
> yet (waiting for the guest to acknowledge it), but we won't actually
> peek anything new, and return early without setting EPOLLET.
>
> This is the original report, mentioning the originally proposed fix:
>
> --
> When there is unacknowledged data in the inbound socket buffer, passt
> leaves the socket in the epoll instance to accept new data from the
> server. Since there is already data in the socket buffer, an epoll
> without EPOLLET will repeatedly fire while no data is processed,
> busy-looping the CPU:
>
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>
> Add in the missing EPOLLET flag for this case. This brings CPU
> usage down from around ~80% when downloading over TCP, to ~5% (use
> case: passt as network transport for muvm, downloading Steam games).
> --
>
> we can't set EPOLLET unconditionally though, at least right now,
> because we don't monitor the guest tap for EPOLLOUT in case we fail
> to write on that side because we filled up that buffer (and not the
> window of a TCP connection).
>
> Instead, rely on the observation that we only get EAGAIN on recvmsg()
> if we are attempting to peek data from a socket with a non-zero
> peeking offset: we only peek when there's pending data on a socket,
> and in that case, if we peek without offset, we'll always see some
> data.
>
> And if we peek data with a non-zero offset and get EAGAIN, that means
> that we're either waiting for more data to arrive on the socket (which
> would cause further wake-ups, even with EPOLLET), or we're waiting for
> the guest to acknowledge some of it, which would anyway cause a
> wake-up.
>
> In that case, it's safe to set STALLED and, in turn, EPOLLET on the
> socket, which fixes the EPOLLIN event loop.
>
> Reported-by: Asahi Lina <lina@asahilina.net>
> Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> tcp_buf.c | 1 +
> tcp_vu.c | 2 ++
> 2 files changed, 3 insertions(+)
>
> diff --git a/tcp_buf.c b/tcp_buf.c
> index a975a55..577467b 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -359,6 +359,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> return -errno;
> }
>
> + conn_flag(c, conn, STALLED);
> return 0;
> }
>
> diff --git a/tcp_vu.c b/tcp_vu.c
> index 10e17d3..22ae71d 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -399,6 +399,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> tcp_rst(c, conn);
> return len;
> }
> +
> + conn_flag(c, conn, STALLED);
This occasionally makes the tests hang at:
passt_vu_in_ns/tcp [19/32] - TCP/IPv6: guest to host: big transfer
for the vhost-user case only. Yes, guest to host.
I didn't really check why and how, I thought the vhost-user path here
would pretty much match the other one, but that doesn't seem to be
the case.
Perhaps we don't always wake up if the guest acknowledges something.
> return 0;
> }
>
--
Stefano
prev parent reply other threads:[~2025-01-16 0:45 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-14 22:33 [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
2025-01-16 0:45 ` Stefano Brivio [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250116014500.7da58696@elisabeth \
--to=sbrivio@redhat.com \
--cc=jmaloy@redhat.com \
--cc=lina@asahilina.net \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
--cc=slp@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).