From: Asahi Lina <lina@asahilina.net>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Sergio Lopez <slp@redhat.com>,
Jon Maloy <jmaloy@redhat.com>
Subject: Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets
Date: Sun, 29 Dec 2024 00:31:11 +0900 [thread overview]
Message-ID: <fca28b52-2db6-4382-bf7e-33e9f1d65488@asahilina.net> (raw)
In-Reply-To: <20241228143044.2119a803@elisabeth>
Hi,
On 12/28/24 10:30 PM, Stefano Brivio wrote:
> Hi, thanks for the report and for the patch! A couple of comments:
>
> On Sat, 28 Dec 2024 20:43:15 +0900
> Asahi Lina <lina@asahilina.net> wrote:
>
>> When there is unacknowledged data in the inbound socket buffer, passt
>> leaves the socket in the epoll instance to accept new data from the
>> server. Since there is already data in the socket buffer, an epoll
>> without EPOLLET will repeatedly fire while no data is processed,
>> busy-looping the CPU:
>>
>> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
>> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
>> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>
> Oops. This is unexpected, in the sense that if we have data on the
> socket (we leave it there until it's acknowledged by the guest), and we
> read all of it, the kernel wakes us up, and tcp_buf_data_from_sock()
> should set the (internal) STALLED flag on the socket, which implies
> EPOLLET:
>
> if (len <= 0) {
> conn_flag(c, conn, STALLED);
> return 0;
> }
>
> But here, we get -EAGAIN, so we don't touch the socket flags at all,
> because we don't expect that the kernel will wake us up again:
>
> if (len < 0) {
> if (errno != EAGAIN && errno != EWOULDBLOCK) {
> tcp_rst(c, conn);
> return -errno;
> }
>
> return 0;
> }
>
> ...and instead it does.
>
> I think it's an unintended consequence of using SO_PEEK_OFF for TCP.
> Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3
> ("tcp: add support for SO_PEEK_OFF socket option"), and matching
> passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF
> socket option when available").
>
> Long story short: as we leave unacknowledged data on the socket, we need
> a way to fetch new data only, and send it to the guest if we have space
> in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer
> for recvmsg() to skip unacknowledged but "old" data.
>
> With SO_PEEK_OFF, we can directly tell the kernel to skip a given
> amount of bytes (the OFFset), and peek "after" that.
>
> But that changes the meaning of EAGAIN, from our perspective. Should
> the kernel really wake us up if there's no data after that offset?
Right, I wasn't explicit in the commit message but I realized this was
caused by SO_PEEK_OFF when I found that in the code, since otherwise the
strace log makes no sense (it would be incorrect to wake up a process if
there is no data available for reading at all, so at first I thought it
could be a weirder bug with event processing triggering the wrong fds to
be polled or something like that). It clicked once I realized what passt
is doing with SO_PEEK_OFF.
In this case, there *is* data available for reading, and it would be
returned by a recvmsg call without MSG_PEEK, so I think the kernel
behavior is correct, since epoll has no idea about SO_PEEK_OFF or that
the socket is being polled for peeking after an offset. For
level-triggered mode, epoll should return EPOLLIN as long as there's any
data in the buffer that would be readable by a regular recvmsg() (I'm
pretty sure it's not supposed to "remember" you're using MSG_PEEK or
something like that, it's supposed to be stateless). My conclusion is
that an event polling loop using MSG_PEEK like this must use EPOLLET.
That's why I just added the flag.
EPOLLET isn't stateful in terms of receive syscalls nor knows anything
about MSG_PEEK or anything like that either, it simply guarantees that
when data is added to the receive buffer there will be one subsequent
notification triggered, which I think is what we want (as long as we
always empty out the buffer before polling again).
> By the way, it's also the first time I see this issue, but I suspect
> that something like this is what's going on at:
>
> https://github.com/containers/podman/issues/23686#issuecomment-2563142828
>
> ...and I just didn't find a way to reproduce it yet.
I haven't tried other workloads, but the Steam downloading thing is very
consistent. It's easy to catch in `strace` scrolling around a bit. I
think since `strace` adds latency, it actually minimizes the problem a
bit (giving the guest time to ACK data) and it's likely that when not
being traced like that, passt is actually spending a huge number of
syscalls in the busy-wait condition (which is why CPU usage balloons).
Several people noticed this with passt in the past, this is actually
just the first time I look into it.
(Actually, I was looking to debug a different bug causing networking to
get stuck entirely, but I haven't gotten there yet, I just ran into this
one first.)
>
>> Add in the missing EPOLLET flag for this case.
>
> This is probably more robust in any case, and would have the obvious
> advantage of fixing your issue right away.
>
> But I'm not entirely sure if we can do that safely, yet. I originally
> disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop
> EPOLLET for non-spliced connections") based on the observation that
> receiving functions don't guarantee we read everything from sockets
> before they return.
>
> To use EPOLLET safely, we should always loop until we get EAGAIN, and
> we don't do that. Now, the codebase changed significantly since then,
> and nowadays we can probably do that without much effort, but I need
> need to have a thorough look at it.
>
> Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is
> correct. Waking up a process just to reply EAGAIN doesn't make a lot of
> sense.
(See above, I think it's correct)
> We could also consider setting STALLED in that case. Even if we come up
> with a kernel fix, we'll need a workaround in passt for older kernels
> anyway.
I was thinking of this, maybe the easiest solution would be to just set
STALLED if we get an EAGAIN *and* the current peek offset is nonzero.
>
> Give me a few days... unless you or somebody else can look into it, of
> course.
>
>> This brings CPU
>> usage down from around ~80% when downloading over TCP, to ~5% (use
>> case: passt as network transport for muvm, downloading Steam games).
>
> Hah, maybe that's the key to reproducing this reliably. My usage of
> passt with muvm at the moment is pretty much limited to SSH, DNS and
> short "test" transfers. I'll give that a try (large HTTP transfers?).
Yeah, I think it's large parallel transfers from a fast CDN (easily
maxes out a gigabit internet connection). Perhaps an "optimized"
downloader like aria2c would work similarly? `aria2c -x4 <URL>` or
something like that.
~~ Lina
next prev parent reply other threads:[~2024-12-28 15:31 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-28 11:43 [PATCH] tcp: Add missing EPOLLET flag for established sockets Asahi Lina
2024-12-28 13:30 ` Stefano Brivio
2024-12-28 15:31 ` Asahi Lina [this message]
2024-12-29 12:25 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fca28b52-2db6-4382-bf7e-33e9f1d65488@asahilina.net \
--to=lina@asahilina.net \
--cc=jmaloy@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
--cc=slp@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).