[PATCH] tcp: Add missing EPOLLET flag for established sockets

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH] tcp: Add missing EPOLLET flag for established sockets
@ 2024-12-28 11:43 Asahi Lina
  2024-12-28 13:30 ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: Asahi Lina @ 2024-12-28 11:43 UTC (permalink / raw)
  To: passt-dev; +Cc: Sergio Lopez, Asahi Lina

When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:

epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)

Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index ec433f7d54bcccc4f1ba33e7add10c1e61807bc8..38042264a83570145d7fe9577d9fe1928435ff4d 100644
--- a/tcp.c
+++ b/tcp.c
@@ -439,7 +439,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 		if (conn_flags & STALLED)
 			return EPOLLIN | EPOLLOUT | EPOLLRDHUP | EPOLLET;
 
-		return EPOLLIN | EPOLLRDHUP;
+		return EPOLLIN | EPOLLRDHUP | EPOLLET;
 	}
 
 	if (events == TAP_SYN_RCVD)

---
base-commit: e5ba8adef71ec53e192373ed1267dc338719dda0
change-id: 20241228-tcp-epollet-fix-3f8e9c736cd1

Cheers,
~~ Lina


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets
  2024-12-28 11:43 [PATCH] tcp: Add missing EPOLLET flag for established sockets Asahi Lina
@ 2024-12-28 13:30 ` Stefano Brivio
  2024-12-28 15:31   ` Asahi Lina
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2024-12-28 13:30 UTC (permalink / raw)
  To: Asahi Lina; +Cc: passt-dev, Sergio Lopez, Jon Maloy

Hi, thanks for the report and for the patch! A couple of comments:

On Sat, 28 Dec 2024 20:43:15 +0900
Asahi Lina <lina@asahilina.net> wrote:

> When there is unacknowledged data in the inbound socket buffer, passt
> leaves the socket in the epoll instance to accept new data from the
> server. Since there is already data in the socket buffer, an epoll
> without EPOLLET will repeatedly fire while no data is processed,
> busy-looping the CPU:
> 
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)

Oops. This is unexpected, in the sense that if we have data on the
socket (we leave it there until it's acknowledged by the guest), and we
read all of it, the kernel wakes us up, and tcp_buf_data_from_sock()
should set the (internal) STALLED flag on the socket, which implies
EPOLLET:

	if (len <= 0) {
		conn_flag(c, conn, STALLED);
		return 0;
	}

But here, we get -EAGAIN, so we don't touch the socket flags at all,
because we don't expect that the kernel will wake us up again:

	if (len < 0) {
		if (errno != EAGAIN && errno != EWOULDBLOCK) {
			tcp_rst(c, conn);
			return -errno;
		}

		return 0;
	}

...and instead it does.

I think it's an unintended consequence of using SO_PEEK_OFF for TCP.
Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3
("tcp: add support for SO_PEEK_OFF socket option"), and matching
passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF
socket option when available").

Long story short: as we leave unacknowledged data on the socket, we need
a way to fetch new data only, and send it to the guest if we have space
in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer
for recvmsg() to skip unacknowledged but "old" data.

With SO_PEEK_OFF, we can directly tell the kernel to skip a given
amount of bytes (the OFFset), and peek "after" that.

But that changes the meaning of EAGAIN, from our perspective. Should
the kernel really wake us up if there's no data after that offset?

By the way, it's also the first time I see this issue, but I suspect
that something like this is what's going on at:

  https://github.com/containers/podman/issues/23686#issuecomment-2563142828

...and I just didn't find a way to reproduce it yet.

> Add in the missing EPOLLET flag for this case.

This is probably more robust in any case, and would have the obvious
advantage of fixing your issue right away.

But I'm not entirely sure if we can do that safely, yet. I originally
disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop
EPOLLET for non-spliced connections") based on the observation that
receiving functions don't guarantee we read everything from sockets
before they return.

To use EPOLLET safely, we should always loop until we get EAGAIN, and
we don't do that. Now, the codebase changed significantly since then,
and nowadays we can probably do that without much effort, but I need
need to have a thorough look at it.

Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is
correct. Waking up a process just to reply EAGAIN doesn't make a lot of
sense.

We could also consider setting STALLED in that case. Even if we come up
with a kernel fix, we'll need a workaround in passt for older kernels
anyway.

Give me a few days... unless you or somebody else can look into it, of
course.

> This brings CPU
> usage down from around ~80% when downloading over TCP, to ~5% (use
> case: passt as network transport for muvm, downloading Steam games).

Hah, maybe that's the key to reproducing this reliably. My usage of
passt with muvm at the moment is pretty much limited to SSH, DNS and
short "test" transfers. I'll give that a try (large HTTP transfers?).

-- 
Stefano

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets
  2024-12-28 13:30 ` Stefano Brivio
@ 2024-12-28 15:31   ` Asahi Lina
  2024-12-29 12:25     ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: Asahi Lina @ 2024-12-28 15:31 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Sergio Lopez, Jon Maloy

Hi,

On 12/28/24 10:30 PM, Stefano Brivio wrote:
> Hi, thanks for the report and for the patch! A couple of comments:
> 
> On Sat, 28 Dec 2024 20:43:15 +0900
> Asahi Lina <lina@asahilina.net> wrote:
> 
>> When there is unacknowledged data in the inbound socket buffer, passt
>> leaves the socket in the epoll instance to accept new data from the
>> server. Since there is already data in the socket buffer, an epoll
>> without EPOLLET will repeatedly fire while no data is processed,
>> busy-looping the CPU:
>>
>> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
>> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
>> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> 
> Oops. This is unexpected, in the sense that if we have data on the
> socket (we leave it there until it's acknowledged by the guest), and we
> read all of it, the kernel wakes us up, and tcp_buf_data_from_sock()
> should set the (internal) STALLED flag on the socket, which implies
> EPOLLET:
> 
> 	if (len <= 0) {
> 		conn_flag(c, conn, STALLED);
> 		return 0;
> 	}
> 
> But here, we get -EAGAIN, so we don't touch the socket flags at all,
> because we don't expect that the kernel will wake us up again:
> 
> 	if (len < 0) {
> 		if (errno != EAGAIN && errno != EWOULDBLOCK) {
> 			tcp_rst(c, conn);
> 			return -errno;
> 		}
> 
> 		return 0;
> 	}
> 
> ...and instead it does.
> 
> I think it's an unintended consequence of using SO_PEEK_OFF for TCP.
> Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3
> ("tcp: add support for SO_PEEK_OFF socket option"), and matching
> passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF
> socket option when available").
> 
> Long story short: as we leave unacknowledged data on the socket, we need
> a way to fetch new data only, and send it to the guest if we have space
> in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer
> for recvmsg() to skip unacknowledged but "old" data.
> 
> With SO_PEEK_OFF, we can directly tell the kernel to skip a given
> amount of bytes (the OFFset), and peek "after" that.
> 
> But that changes the meaning of EAGAIN, from our perspective. Should
> the kernel really wake us up if there's no data after that offset?

Right, I wasn't explicit in the commit message but I realized this was
caused by SO_PEEK_OFF when I found that in the code, since otherwise the
strace log makes no sense (it would be incorrect to wake up a process if
there is no data available for reading at all, so at first I thought it
could be a weirder bug with event processing triggering the wrong fds to
be polled or something like that). It clicked once I realized what passt
is doing with SO_PEEK_OFF.

In this case, there *is* data available for reading, and it would be
returned by a recvmsg call without MSG_PEEK, so I think the kernel
behavior is correct, since epoll has no idea about SO_PEEK_OFF or that
the socket is being polled for peeking after an offset. For
level-triggered mode, epoll should return EPOLLIN as long as there's any
data in the buffer that would be readable by a regular recvmsg() (I'm
pretty sure it's not supposed to "remember" you're using MSG_PEEK or
something like that, it's supposed to be stateless). My conclusion is
that an event polling loop using MSG_PEEK like this must use EPOLLET.
That's why I just added the flag.

EPOLLET isn't stateful in terms of receive syscalls nor knows anything
about MSG_PEEK or anything like that either, it simply guarantees that
when data is added to the receive buffer there will be one subsequent
notification triggered, which I think is what we want (as long as we
always empty out the buffer before polling again).

> By the way, it's also the first time I see this issue, but I suspect
> that something like this is what's going on at:
> 
>   https://github.com/containers/podman/issues/23686#issuecomment-2563142828
> 
> ...and I just didn't find a way to reproduce it yet.

I haven't tried other workloads, but the Steam downloading thing is very
consistent. It's easy to catch in `strace` scrolling around a bit. I
think since `strace` adds latency, it actually minimizes the problem a
bit (giving the guest time to ACK data) and it's likely that when not
being traced like that, passt is actually spending a huge number of
syscalls in the busy-wait condition (which is why CPU usage balloons).
Several people noticed this with passt in the past, this is actually
just the first time I look into it.

(Actually, I was looking to debug a different bug causing networking to
get stuck entirely, but I haven't gotten there yet, I just ran into this
one first.)

> 
>> Add in the missing EPOLLET flag for this case.
> 
> This is probably more robust in any case, and would have the obvious
> advantage of fixing your issue right away.
> 
> But I'm not entirely sure if we can do that safely, yet. I originally
> disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop
> EPOLLET for non-spliced connections") based on the observation that
> receiving functions don't guarantee we read everything from sockets
> before they return.
> 
> To use EPOLLET safely, we should always loop until we get EAGAIN, and
> we don't do that. Now, the codebase changed significantly since then,
> and nowadays we can probably do that without much effort, but I need
> need to have a thorough look at it.
> 
> Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is
> correct. Waking up a process just to reply EAGAIN doesn't make a lot of
> sense.

(See above, I think it's correct)

> We could also consider setting STALLED in that case. Even if we come up
> with a kernel fix, we'll need a workaround in passt for older kernels
> anyway.

I was thinking of this, maybe the easiest solution would be to just set
STALLED if we get an EAGAIN *and* the current peek offset is nonzero.

> 
> Give me a few days... unless you or somebody else can look into it, of
> course.
> 
>> This brings CPU
>> usage down from around ~80% when downloading over TCP, to ~5% (use
>> case: passt as network transport for muvm, downloading Steam games).
> 
> Hah, maybe that's the key to reproducing this reliably. My usage of
> passt with muvm at the moment is pretty much limited to SSH, DNS and
> short "test" transfers. I'll give that a try (large HTTP transfers?).

Yeah, I think it's large parallel transfers from a fast CDN (easily
maxes out a gigabit internet connection). Perhaps an "optimized"
downloader like aria2c would work similarly? `aria2c -x4 <URL>` or
something like that.

~~ Lina

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets
  2024-12-28 15:31   ` Asahi Lina
@ 2024-12-29 12:25     ` Stefano Brivio
  2025-01-10 10:28       ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2024-12-29 12:25 UTC (permalink / raw)
  To: Asahi Lina; +Cc: passt-dev, Sergio Lopez, Jon Maloy

On Sun, 29 Dec 2024 00:31:11 +0900
Asahi Lina <lina@asahilina.net> wrote:

> Hi,
> 
> On 12/28/24 10:30 PM, Stefano Brivio wrote:
> > Hi, thanks for the report and for the patch! A couple of comments:
> > 
> > On Sat, 28 Dec 2024 20:43:15 +0900
> > Asahi Lina <lina@asahilina.net> wrote:
> >   
> >> When there is unacknowledged data in the inbound socket buffer, passt
> >> leaves the socket in the epoll instance to accept new data from the
> >> server. Since there is already data in the socket buffer, an epoll
> >> without EPOLLET will repeatedly fire while no data is processed,
> >> busy-looping the CPU:
> >>
> >> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> >> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
> >> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> >> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> >> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> >> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
> >> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
> >> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> >> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> >> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)  
> > 
> > Oops. This is unexpected, in the sense that if we have data on the
> > socket (we leave it there until it's acknowledged by the guest), and we
> > read all of it, the kernel wakes us up, and tcp_buf_data_from_sock()
> > should set the (internal) STALLED flag on the socket, which implies
> > EPOLLET:
> > 
> > 	if (len <= 0) {
> > 		conn_flag(c, conn, STALLED);
> > 		return 0;
> > 	}
> > 
> > But here, we get -EAGAIN, so we don't touch the socket flags at all,
> > because we don't expect that the kernel will wake us up again:
> > 
> > 	if (len < 0) {
> > 		if (errno != EAGAIN && errno != EWOULDBLOCK) {
> > 			tcp_rst(c, conn);
> > 			return -errno;
> > 		}
> > 
> > 		return 0;
> > 	}
> > 
> > ...and instead it does.
> > 
> > I think it's an unintended consequence of using SO_PEEK_OFF for TCP.
> > Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3
> > ("tcp: add support for SO_PEEK_OFF socket option"), and matching
> > passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF
> > socket option when available").
> > 
> > Long story short: as we leave unacknowledged data on the socket, we need
> > a way to fetch new data only, and send it to the guest if we have space
> > in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer
> > for recvmsg() to skip unacknowledged but "old" data.
> > 
> > With SO_PEEK_OFF, we can directly tell the kernel to skip a given
> > amount of bytes (the OFFset), and peek "after" that.
> > 
> > But that changes the meaning of EAGAIN, from our perspective. Should
> > the kernel really wake us up if there's no data after that offset?  
> 
> Right, I wasn't explicit in the commit message but I realized this was
> caused by SO_PEEK_OFF when I found that in the code, since otherwise the
> strace log makes no sense (it would be incorrect to wake up a process if
> there is no data available for reading at all, so at first I thought it
> could be a weirder bug with event processing triggering the wrong fds to
> be polled or something like that). It clicked once I realized what passt
> is doing with SO_PEEK_OFF.
> 
> In this case, there *is* data available for reading, and it would be
> returned by a recvmsg call without MSG_PEEK, so I think the kernel
> behavior is correct, since epoll has no idea about SO_PEEK_OFF or that
> the socket is being polled for peeking after an offset. For
> level-triggered mode, epoll should return EPOLLIN as long as there's any
> data in the buffer that would be readable by a regular recvmsg() (I'm
> pretty sure it's not supposed to "remember" you're using MSG_PEEK or
> something like that, it's supposed to be stateless).

Right, epoll is stateless in terms of the current SO_PEEK_OFF offset,
but I don't remember right now (I haven't tested this recently and I
can't find out from a quick glance at the code) if it keeps track of
data that was *read* but not dequeued. I would need to check.

> My conclusion is
> that an event polling loop using MSG_PEEK like this must use EPOLLET.
> That's why I just added the flag.
> 
> EPOLLET isn't stateful in terms of receive syscalls nor knows anything
> about MSG_PEEK or anything like that either, it simply guarantees that
> when data is added to the receive buffer there will be one subsequent
> notification triggered, which I think is what we want (as long as we
> always empty out the buffer before polling again).

...except that we can't always empty the buffer. This isn't really a
problem though, more on that below.

> > By the way, it's also the first time I see this issue, but I suspect
> > that something like this is what's going on at:
> > 
> >   https://github.com/containers/podman/issues/23686#issuecomment-2563142828
> > 
> > ...and I just didn't find a way to reproduce it yet.  
> 
> I haven't tried other workloads, but the Steam downloading thing is very
> consistent. It's easy to catch in `strace` scrolling around a bit. I
> think since `strace` adds latency, it actually minimizes the problem a
> bit (giving the guest time to ACK data) and it's likely that when not
> being traced like that, passt is actually spending a huge number of
> syscalls in the busy-wait condition (which is why CPU usage balloons).
> Several people noticed this with passt in the past, this is actually
> just the first time I look into it.

It's the first time I see this reported. :(

> (Actually, I was looking to debug a different bug causing networking to
> get stuck entirely, but I haven't gotten there yet, I just ran into this
> one first.)

That sounds bad, let me know if there's some debugging you might need
help with.

> >> Add in the missing EPOLLET flag for this case.  
> > 
> > This is probably more robust in any case, and would have the obvious
> > advantage of fixing your issue right away.
> > 
> > But I'm not entirely sure if we can do that safely, yet. I originally
> > disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop
> > EPOLLET for non-spliced connections") based on the observation that
> > receiving functions don't guarantee we read everything from sockets
> > before they return.
> > 
> > To use EPOLLET safely, we should always loop until we get EAGAIN, and
> > we don't do that. Now, the codebase changed significantly since then,
> > and nowadays we can probably do that without much effort, but I need
> > need to have a thorough look at it.
> > 
> > Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is
> > correct. Waking up a process just to reply EAGAIN doesn't make a lot of
> > sense.  
> 
> (See above, I think it's correct)
> 
> > We could also consider setting STALLED in that case. Even if we come up
> > with a kernel fix, we'll need a workaround in passt for older kernels
> > anyway.  
> 
> I was thinking of this, maybe the easiest solution would be to just set
> STALLED if we get an EAGAIN *and* the current peek offset is nonzero.

Actually, I think this is the correct solution (with "peek offset" here
meaning the actual offset at which we peek, not the value we pass to
SO_PEEK_OFF, see below), because it substantially restores the
behaviour that was intended before e63d281871ef ("tcp: leverage support
of SO_PEEK_OFF socket option when available").

That's what STALLED was for: if there's data on the socket, but we
can't read anything more, set EPOLLET. An ACK from the guest will make
us clear EPOLLET and have another look.

Right now, STALLED is not always set when it should. If it's cleared
because of an ACK which doesn't acknowledge all the data pending on the
socket, then it's broken, because of the early return on EAGAIN.

Note that we generally set SO_PEEK_OFF to 0 (and reset it on
retransmissions, that is, whenever we need to "re-read" data), but that
does *not* mean that the offset is zero: it means that the offset is
"what we last peeked" (see also socket(7)).

I know, it sounds terribly confusing, but it makes SO_PEEK_OFF actually
useful (the overhead from system calls would otherwise make it not
worth using).

The actual peeking offset will be zero just after tcp_sock_consume(),
without an intervening EPOLLIN, and in a particular case of
tcp_revert_seq() (retransmission).

I'm not sure if we can ever have EAGAIN from the socket with a zero
peeking offset. If we can, then we need to track that. If we can't,
then what you suggest is equivalent to just setting STALLED if we get
EAGAIN.

By the way, I wonder, if it's not too much effort to check: do you hit
this without SO_PEEK_OFF (return false in tcp_probe_peek_offset_cap())?

> > Give me a few days... unless you or somebody else can look into it, of
> > course.
> >   
> >> This brings CPU
> >> usage down from around ~80% when downloading over TCP, to ~5% (use
> >> case: passt as network transport for muvm, downloading Steam games).  
> > 
> > Hah, maybe that's the key to reproducing this reliably. My usage of
> > passt with muvm at the moment is pretty much limited to SSH, DNS and
> > short "test" transfers. I'll give that a try (large HTTP transfers?).  
> 
> Yeah, I think it's large parallel transfers from a fast CDN (easily
> maxes out a gigabit internet connection). Perhaps an "optimized"
> downloader like aria2c would work similarly? `aria2c -x4 <URL>` or
> something like that.

Perhaps, yes, thanks for the tip.

That's something I do quite frequently with QEMU guests (updating
distribution packages from a mirror very close, large parallel
transfers with iperf3, etc.), but maybe the virtio-net implementation
in libkrun is somehow peculiar in this regard.

Which, by the way, reminds me that now that we have vhost-user support
in passt (zero-copy socket reads/writes directly from/to guest memory),
we should look into enabling that for libkrun as well (not just for
QEMU), but that's another story.

-- 
Stefano


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets
  2024-12-29 12:25     ` Stefano Brivio
@ 2025-01-10 10:28       ` Stefano Brivio
  0 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-01-10 10:28 UTC (permalink / raw)
  To: Asahi Lina; +Cc: passt-dev, Sergio Lopez, Jon Maloy

Sorry, it took a while (and I could reproduce this with muvm, too):

On Sun, 29 Dec 2024 13:25:30 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Sun, 29 Dec 2024 00:31:11 +0900
> Asahi Lina <lina@asahilina.net> wrote:
> 
> > I was thinking of this, maybe the easiest solution would be to just set
> > STALLED if we get an EAGAIN *and* the current peek offset is nonzero.  
> 
> Actually, I think this is the correct solution (with "peek offset" here
> meaning the actual offset at which we peek, not the value we pass to
> SO_PEEK_OFF, see below), because it substantially restores the
> behaviour that was intended before e63d281871ef ("tcp: leverage support
> of SO_PEEK_OFF socket option when available").
> 
> That's what STALLED was for: if there's data on the socket, but we
> can't read anything more, set EPOLLET. An ACK from the guest will make
> us clear EPOLLET and have another look.
> 
> Right now, STALLED is not always set when it should. If it's cleared
> because of an ACK which doesn't acknowledge all the data pending on the
> socket, then it's broken, because of the early return on EAGAIN.
> 
> Note that we generally set SO_PEEK_OFF to 0 (and reset it on
> retransmissions, that is, whenever we need to "re-read" data), but that
> does *not* mean that the offset is zero: it means that the offset is
> "what we last peeked" (see also socket(7)).
> 
> I know, it sounds terribly confusing, but it makes SO_PEEK_OFF actually
> useful (the overhead from system calls would otherwise make it not
> worth using).
> 
> The actual peeking offset will be zero just after tcp_sock_consume(),
> without an intervening EPOLLIN, and in a particular case of
> tcp_revert_seq() (retransmission).
> 
> I'm not sure if we can ever have EAGAIN from the socket with a zero
> peeking offset. If we can, then we need to track that. If we can't,
> then what you suggest is equivalent to just setting STALLED if we get
> EAGAIN.

So, it looks like there are no paths where we would get EAGAIN with a
zero peeking offset. Setting STALLED in both tcp_buf_data_from_sock()
and tcp_vu_data_from_sock() before returning on EAGAIN actually
implements what you suggested, solves the problem, and doesn't create
new ones (at least in my tests).

About setting EPOLLET unconditionally, we also discussed this with
David Gibson in the latest development call: we could, in theory (and
he even tried in the past), because there are two cases where we return
before dequeuing all the data available from the socket (and we need a
subsequent wake-up even if no further data arrives):

- if we can't queue more data to the guest, in window. But then we can
  always rely on the guest waking us up with an ACK (or the ACK timeout
  handler, which we have)

- if we fail to write to the hypervisor (or tap device, for containers).
  It shouldn't really happen under normal circumstances, but there are
  no guarantees.

  In that case, we *should* set EPOLLOUT on that descriptor, and keep
  track of all the sockets that had pending data, and poll them once
  the guest descriptor is ready for writing, otherwise they would get
  stuck (at least until a timeout occurs, but that might be bad enough
  to break functionality).

  But we don't have this implementation, yet.

For the moment, I think we should set STALLED like you suggested (in
those two functions, before the early return). Do you want to send a
second version of the patch, or should I? I would actually be happy to
add more names to the git log. :)

> By the way, I wonder, if it's not too much effort to check: do you hit
> this without SO_PEEK_OFF (return false in tcp_probe_peek_offset_cap())?
> 
> > > Give me a few days... unless you or somebody else can look into it, of
> > > course.
> > >     
> > >> This brings CPU
> > >> usage down from around ~80% when downloading over TCP, to ~5% (use
> > >> case: passt as network transport for muvm, downloading Steam games).    
> > > 
> > > Hah, maybe that's the key to reproducing this reliably. My usage of
> > > passt with muvm at the moment is pretty much limited to SSH, DNS and
> > > short "test" transfers. I'll give that a try (large HTTP transfers?).    
> > 
> > Yeah, I think it's large parallel transfers from a fast CDN (easily
> > maxes out a gigabit internet connection). Perhaps an "optimized"
> > downloader like aria2c would work similarly? `aria2c -x4 <URL>` or
> > something like that.  
> 
> Perhaps, yes, thanks for the tip.
> 
> That's something I do quite frequently with QEMU guests (updating
> distribution packages from a mirror very close, large parallel
> transfers with iperf3, etc.), but maybe the virtio-net implementation
> in libkrun is somehow peculiar in this regard.

I could reproduce only with muvm and aria2c -x4 as you suggested, with
a long transfer, *not* local but almost: same data centre and about
~300µs RTT. More than that, and I don't see the issue, because the
guest's window is always large enough otherwise (I guess).

But yes, there's a peculiarity with muvm: the MTU is set to just 1500
bytes, whereas we typically have 65520 bytes with QEMU. With user-mode
networking, given that packetisation is (re-)done on the host, anything
less than what the guest interface supports doesn't make sense.

I'm using dhcpcd, but https://github.com/AsahiLinux/muvm/pull/111 makes
no difference (yet) because I forgot to handle the MTU there (passt
advertises 65520). For some reason, dhcpcd is also not setting the MTU.

I'll fix this there (or in a follow-up change). If I manually set 65520
bytes, by the way, I can't reproduce the issue anymore.

-- 
Stefano

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-01-10 10:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-28 11:43 [PATCH] tcp: Add missing EPOLLET flag for established sockets Asahi Lina
2024-12-28 13:30 ` Stefano Brivio
2024-12-28 15:31   ` Asahi Lina
2024-12-29 12:25     ` Stefano Brivio
2025-01-10 10:28       ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).