* [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
@ 2025-01-14 22:33 Stefano Brivio
2025-01-16 0:45 ` Stefano Brivio
0 siblings, 1 reply; 2+ messages in thread
From: Stefano Brivio @ 2025-01-14 22:33 UTC (permalink / raw)
To: passt-dev; +Cc: Asahi Lina, Sergio Lopez, Jon Maloy
Before SO_PEEK_OFF support was introduced by commit e63d281871ef
("tcp: leverage support of SO_PEEK_OFF socket option when available"),
we would peek data from sockets using a "discard" buffer as first
iovec element, so that, unless we had no pending data at all, we would
always get a positive return code from recvmsg() (except for closing
connections or errors).
If we couldn't send more data to the guest, in the window, we would
set the STALLED flag (causing the epoll descriptor to switch to
edge-triggered mode), and return early from tcp_data_from_sock().
With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
on the socket, but nothing beyond our current peeking offset, we'll
get EAGAIN instead of our current "discard" length. In that case, we
return even earlier, and we don't set EPOLLET on the socket as a
result.
As reported by Asahi Lina, this causes event loops where the kernel is
signalling socket readiness, because there's data we didn't dequeue
yet (waiting for the guest to acknowledge it), but we won't actually
peek anything new, and return early without setting EPOLLET.
This is the original report, mentioning the originally proposed fix:
--
When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).
--
we can't set EPOLLET unconditionally though, at least right now,
because we don't monitor the guest tap for EPOLLOUT in case we fail
to write on that side because we filled up that buffer (and not the
window of a TCP connection).
Instead, rely on the observation that we only get EAGAIN on recvmsg()
if we are attempting to peek data from a socket with a non-zero
peeking offset: we only peek when there's pending data on a socket,
and in that case, if we peek without offset, we'll always see some
data.
And if we peek data with a non-zero offset and get EAGAIN, that means
that we're either waiting for more data to arrive on the socket (which
would cause further wake-ups, even with EPOLLET), or we're waiting for
the guest to acknowledge some of it, which would anyway cause a
wake-up.
In that case, it's safe to set STALLED and, in turn, EPOLLET on the
socket, which fixes the EPOLLIN event loop.
Reported-by: Asahi Lina <lina@asahilina.net>
Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp_buf.c | 1 +
tcp_vu.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/tcp_buf.c b/tcp_buf.c
index a975a55..577467b 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -359,6 +359,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
return -errno;
}
+ conn_flag(c, conn, STALLED);
return 0;
}
diff --git a/tcp_vu.c b/tcp_vu.c
index 10e17d3..22ae71d 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -399,6 +399,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_rst(c, conn);
return len;
}
+
+ conn_flag(c, conn, STALLED);
return 0;
}
--
@@ -399,6 +399,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_rst(c, conn);
return len;
}
+
+ conn_flag(c, conn, STALLED);
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN
2025-01-14 22:33 [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
@ 2025-01-16 0:45 ` Stefano Brivio
0 siblings, 0 replies; 2+ messages in thread
From: Stefano Brivio @ 2025-01-16 0:45 UTC (permalink / raw)
To: passt-dev; +Cc: Asahi Lina, Sergio Lopez, Jon Maloy, Laurent Vivier
On Tue, 14 Jan 2025 23:33:16 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:
> Before SO_PEEK_OFF support was introduced by commit e63d281871ef
> ("tcp: leverage support of SO_PEEK_OFF socket option when available"),
> we would peek data from sockets using a "discard" buffer as first
> iovec element, so that, unless we had no pending data at all, we would
> always get a positive return code from recvmsg() (except for closing
> connections or errors).
>
> If we couldn't send more data to the guest, in the window, we would
> set the STALLED flag (causing the epoll descriptor to switch to
> edge-triggered mode), and return early from tcp_data_from_sock().
>
> With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
> on the socket, but nothing beyond our current peeking offset, we'll
> get EAGAIN instead of our current "discard" length. In that case, we
> return even earlier, and we don't set EPOLLET on the socket as a
> result.
>
> As reported by Asahi Lina, this causes event loops where the kernel is
> signalling socket readiness, because there's data we didn't dequeue
> yet (waiting for the guest to acknowledge it), but we won't actually
> peek anything new, and return early without setting EPOLLET.
>
> This is the original report, mentioning the originally proposed fix:
>
> --
> When there is unacknowledged data in the inbound socket buffer, passt
> leaves the socket in the epoll instance to accept new data from the
> server. Since there is already data in the socket buffer, an epoll
> without EPOLLET will repeatedly fire while no data is processed,
> busy-looping the CPU:
>
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>
> Add in the missing EPOLLET flag for this case. This brings CPU
> usage down from around ~80% when downloading over TCP, to ~5% (use
> case: passt as network transport for muvm, downloading Steam games).
> --
>
> we can't set EPOLLET unconditionally though, at least right now,
> because we don't monitor the guest tap for EPOLLOUT in case we fail
> to write on that side because we filled up that buffer (and not the
> window of a TCP connection).
>
> Instead, rely on the observation that we only get EAGAIN on recvmsg()
> if we are attempting to peek data from a socket with a non-zero
> peeking offset: we only peek when there's pending data on a socket,
> and in that case, if we peek without offset, we'll always see some
> data.
>
> And if we peek data with a non-zero offset and get EAGAIN, that means
> that we're either waiting for more data to arrive on the socket (which
> would cause further wake-ups, even with EPOLLET), or we're waiting for
> the guest to acknowledge some of it, which would anyway cause a
> wake-up.
>
> In that case, it's safe to set STALLED and, in turn, EPOLLET on the
> socket, which fixes the EPOLLIN event loop.
>
> Reported-by: Asahi Lina <lina@asahilina.net>
> Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> tcp_buf.c | 1 +
> tcp_vu.c | 2 ++
> 2 files changed, 3 insertions(+)
>
> diff --git a/tcp_buf.c b/tcp_buf.c
> index a975a55..577467b 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -359,6 +359,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> return -errno;
> }
>
> + conn_flag(c, conn, STALLED);
> return 0;
> }
>
> diff --git a/tcp_vu.c b/tcp_vu.c
> index 10e17d3..22ae71d 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -399,6 +399,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> tcp_rst(c, conn);
> return len;
> }
> +
> + conn_flag(c, conn, STALLED);
This occasionally makes the tests hang at:
passt_vu_in_ns/tcp [19/32] - TCP/IPv6: guest to host: big transfer
for the vhost-user case only. Yes, guest to host.
I didn't really check why and how, I thought the vhost-user path here
would pretty much match the other one, but that doesn't seem to be
the case.
Perhaps we don't always wake up if the guest acknowledges something.
> return 0;
> }
>
--
Stefano
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-01-16 0:45 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-14 22:33 [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Stefano Brivio
2025-01-16 0:45 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).