From: Stefano Brivio <sbrivio@redhat.com>
To: Asahi Lina <lina@asahilina.net>
Cc: passt-dev@passt.top, Sergio Lopez <slp@redhat.com>,
Jon Maloy <jmaloy@redhat.com>
Subject: Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets
Date: Sat, 28 Dec 2024 14:30:44 +0100 [thread overview]
Message-ID: <20241228143044.2119a803@elisabeth> (raw)
In-Reply-To: <20241228-tcp-epollet-fix-v1-1-0ff63e0dff63@asahilina.net>
Hi, thanks for the report and for the patch! A couple of comments:
On Sat, 28 Dec 2024 20:43:15 +0900
Asahi Lina <lina@asahilina.net> wrote:
> When there is unacknowledged data in the inbound socket buffer, passt
> leaves the socket in the epoll instance to accept new data from the
> server. Since there is already data in the socket buffer, an epoll
> without EPOLLET will repeatedly fire while no data is processed,
> busy-looping the CPU:
>
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
Oops. This is unexpected, in the sense that if we have data on the
socket (we leave it there until it's acknowledged by the guest), and we
read all of it, the kernel wakes us up, and tcp_buf_data_from_sock()
should set the (internal) STALLED flag on the socket, which implies
EPOLLET:
if (len <= 0) {
conn_flag(c, conn, STALLED);
return 0;
}
But here, we get -EAGAIN, so we don't touch the socket flags at all,
because we don't expect that the kernel will wake us up again:
if (len < 0) {
if (errno != EAGAIN && errno != EWOULDBLOCK) {
tcp_rst(c, conn);
return -errno;
}
return 0;
}
...and instead it does.
I think it's an unintended consequence of using SO_PEEK_OFF for TCP.
Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3
("tcp: add support for SO_PEEK_OFF socket option"), and matching
passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF
socket option when available").
Long story short: as we leave unacknowledged data on the socket, we need
a way to fetch new data only, and send it to the guest if we have space
in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer
for recvmsg() to skip unacknowledged but "old" data.
With SO_PEEK_OFF, we can directly tell the kernel to skip a given
amount of bytes (the OFFset), and peek "after" that.
But that changes the meaning of EAGAIN, from our perspective. Should
the kernel really wake us up if there's no data after that offset?
By the way, it's also the first time I see this issue, but I suspect
that something like this is what's going on at:
https://github.com/containers/podman/issues/23686#issuecomment-2563142828
...and I just didn't find a way to reproduce it yet.
> Add in the missing EPOLLET flag for this case.
This is probably more robust in any case, and would have the obvious
advantage of fixing your issue right away.
But I'm not entirely sure if we can do that safely, yet. I originally
disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop
EPOLLET for non-spliced connections") based on the observation that
receiving functions don't guarantee we read everything from sockets
before they return.
To use EPOLLET safely, we should always loop until we get EAGAIN, and
we don't do that. Now, the codebase changed significantly since then,
and nowadays we can probably do that without much effort, but I need
need to have a thorough look at it.
Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is
correct. Waking up a process just to reply EAGAIN doesn't make a lot of
sense.
We could also consider setting STALLED in that case. Even if we come up
with a kernel fix, we'll need a workaround in passt for older kernels
anyway.
Give me a few days... unless you or somebody else can look into it, of
course.
> This brings CPU
> usage down from around ~80% when downloading over TCP, to ~5% (use
> case: passt as network transport for muvm, downloading Steam games).
Hah, maybe that's the key to reproducing this reliably. My usage of
passt with muvm at the moment is pretty much limited to SSH, DNS and
short "test" transfers. I'll give that a try (large HTTP transfers?).
--
Stefano
next prev parent reply other threads:[~2024-12-28 13:30 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-28 11:43 [PATCH] tcp: Add missing EPOLLET flag for established sockets Asahi Lina
2024-12-28 13:30 ` Stefano Brivio [this message]
2024-12-28 15:31 ` Asahi Lina
2024-12-29 12:25 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241228143044.2119a803@elisabeth \
--to=sbrivio@redhat.com \
--cc=jmaloy@redhat.com \
--cc=lina@asahilina.net \
--cc=passt-dev@passt.top \
--cc=slp@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).