From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=asahilina.net
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=asahilina.net header.i=@asahilina.net header.a=rsa-sha256 header.s=default header.b=trOlS+5F;
	dkim-atps=neutral
Received: from mail.marcansoft.com (marcansoft.com [212.63.210.85])
	by passt.top (Postfix) with ESMTPS id 12B165A004E
	for <passt-dev@passt.top>; Sat, 28 Dec 2024 16:31:15 +0100 (CET)
Received: from [127.0.0.1] (localhost [127.0.0.1])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	(Authenticated sender: lina@asahilina.net)
	by mail.marcansoft.com (Postfix) with ESMTPSA id 6DEE541DF4;
	Sat, 28 Dec 2024 15:31:13 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=asahilina.net;
	s=default; t=1735399873;
	bh=N7bjd6wH+jA5KCof1kKNNA2vQ2LfjRHtZuvYwGox7/o=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To;
	b=trOlS+5F45BphMZXyjcikwOwjHTwdosdN6hlv/vuCn85zxke2uxjDjm7pOt+vwxQJ
	 L9TxaXANJReriVJKwZXqULmGhRiOQD+3bSuKYLQ1mZgBhqSUrfyZA7q+hD4Zgsx/92
	 xHYBrMSkajhLslgYTI47EIjA3CwjzO1dst3KCKQ57Tu3OiWO+JCa9YlKcqJ2bhK2rb
	 XUEsSDbFfmaE7JhGAzX8RHpPyDh1OaLP9Oj2/oJmDm8ZyiL8P0y6hfL7PMI+vwFsed
	 yS5tI17DjHHgyprTnERb8WgWEK8E3oiKxVCy9CUeIiBOyXNobGmU44lhR8Sw/aGrvD
	 4SNNj8GdQ0leQ==
Message-ID: <fca28b52-2db6-4382-bf7e-33e9f1d65488@asahilina.net>
Date: Sun, 29 Dec 2024 00:31:11 +0900
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets
To: Stefano Brivio <sbrivio@redhat.com>
References: <20241228-tcp-epollet-fix-v1-1-0ff63e0dff63@asahilina.net>
 <20241228143044.2119a803@elisabeth>
Content-Language: en-US
From: Asahi Lina <lina@asahilina.net>
In-Reply-To: <20241228143044.2119a803@elisabeth>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Message-ID-Hash: AH2UCHXX5LU63BVPU5RC4VZRMQO3WYVO
X-Message-ID-Hash: AH2UCHXX5LU63BVPU5RC4VZRMQO3WYVO
X-MailFrom: lina@asahilina.net
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top, Sergio Lopez <slp@redhat.com>, Jon Maloy <jmaloy@redhat.com>
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/fca28b52-2db6-4382-bf7e-33e9f1d65488@asahilina.net/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/AH2UCHXX5LU63BVPU5RC4VZRMQO3WYVO/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

Hi,

On 12/28/24 10:30 PM, Stefano Brivio wrote:
> Hi, thanks for the report and for the patch! A couple of comments:
> 
> On Sat, 28 Dec 2024 20:43:15 +0900
> Asahi Lina <lina@asahilina.net> wrote:
> 
>> When there is unacknowledged data in the inbound socket buffer, passt
>> leaves the socket in the epoll instance to accept new data from the
>> server. Since there is already data in the socket buffer, an epoll
>> without EPOLLET will repeatedly fire while no data is processed,
>> busy-looping the CPU:
>>
>> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
>> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
>> recvmsg(25, {msg_namelen=0}, MSG_PEEK)  = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
>> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
> 
> Oops. This is unexpected, in the sense that if we have data on the
> socket (we leave it there until it's acknowledged by the guest), and we
> read all of it, the kernel wakes us up, and tcp_buf_data_from_sock()
> should set the (internal) STALLED flag on the socket, which implies
> EPOLLET:
> 
> 	if (len <= 0) {
> 		conn_flag(c, conn, STALLED);
> 		return 0;
> 	}
> 
> But here, we get -EAGAIN, so we don't touch the socket flags at all,
> because we don't expect that the kernel will wake us up again:
> 
> 	if (len < 0) {
> 		if (errno != EAGAIN && errno != EWOULDBLOCK) {
> 			tcp_rst(c, conn);
> 			return -errno;
> 		}
> 
> 		return 0;
> 	}
> 
> ...and instead it does.
> 
> I think it's an unintended consequence of using SO_PEEK_OFF for TCP.
> Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3
> ("tcp: add support for SO_PEEK_OFF socket option"), and matching
> passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF
> socket option when available").
> 
> Long story short: as we leave unacknowledged data on the socket, we need
> a way to fetch new data only, and send it to the guest if we have space
> in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer
> for recvmsg() to skip unacknowledged but "old" data.
> 
> With SO_PEEK_OFF, we can directly tell the kernel to skip a given
> amount of bytes (the OFFset), and peek "after" that.
> 
> But that changes the meaning of EAGAIN, from our perspective. Should
> the kernel really wake us up if there's no data after that offset?

Right, I wasn't explicit in the commit message but I realized this was
caused by SO_PEEK_OFF when I found that in the code, since otherwise the
strace log makes no sense (it would be incorrect to wake up a process if
there is no data available for reading at all, so at first I thought it
could be a weirder bug with event processing triggering the wrong fds to
be polled or something like that). It clicked once I realized what passt
is doing with SO_PEEK_OFF.

In this case, there *is* data available for reading, and it would be
returned by a recvmsg call without MSG_PEEK, so I think the kernel
behavior is correct, since epoll has no idea about SO_PEEK_OFF or that
the socket is being polled for peeking after an offset. For
level-triggered mode, epoll should return EPOLLIN as long as there's any
data in the buffer that would be readable by a regular recvmsg() (I'm
pretty sure it's not supposed to "remember" you're using MSG_PEEK or
something like that, it's supposed to be stateless). My conclusion is
that an event polling loop using MSG_PEEK like this must use EPOLLET.
That's why I just added the flag.

EPOLLET isn't stateful in terms of receive syscalls nor knows anything
about MSG_PEEK or anything like that either, it simply guarantees that
when data is added to the receive buffer there will be one subsequent
notification triggered, which I think is what we want (as long as we
always empty out the buffer before polling again).

> By the way, it's also the first time I see this issue, but I suspect
> that something like this is what's going on at:
> 
>   https://github.com/containers/podman/issues/23686#issuecomment-2563142828
> 
> ...and I just didn't find a way to reproduce it yet.

I haven't tried other workloads, but the Steam downloading thing is very
consistent. It's easy to catch in `strace` scrolling around a bit. I
think since `strace` adds latency, it actually minimizes the problem a
bit (giving the guest time to ACK data) and it's likely that when not
being traced like that, passt is actually spending a huge number of
syscalls in the busy-wait condition (which is why CPU usage balloons).
Several people noticed this with passt in the past, this is actually
just the first time I look into it.

(Actually, I was looking to debug a different bug causing networking to
get stuck entirely, but I haven't gotten there yet, I just ran into this
one first.)

> 
>> Add in the missing EPOLLET flag for this case.
> 
> This is probably more robust in any case, and would have the obvious
> advantage of fixing your issue right away.
> 
> But I'm not entirely sure if we can do that safely, yet. I originally
> disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop
> EPOLLET for non-spliced connections") based on the observation that
> receiving functions don't guarantee we read everything from sockets
> before they return.
> 
> To use EPOLLET safely, we should always loop until we get EAGAIN, and
> we don't do that. Now, the codebase changed significantly since then,
> and nowadays we can probably do that without much effort, but I need
> need to have a thorough look at it.
> 
> Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is
> correct. Waking up a process just to reply EAGAIN doesn't make a lot of
> sense.

(See above, I think it's correct)

> We could also consider setting STALLED in that case. Even if we come up
> with a kernel fix, we'll need a workaround in passt for older kernels
> anyway.

I was thinking of this, maybe the easiest solution would be to just set
STALLED if we get an EAGAIN *and* the current peek offset is nonzero.

> 
> Give me a few days... unless you or somebody else can look into it, of
> course.
> 
>> This brings CPU
>> usage down from around ~80% when downloading over TCP, to ~5% (use
>> case: passt as network transport for muvm, downloading Steam games).
> 
> Hah, maybe that's the key to reproducing this reliably. My usage of
> passt with muvm at the moment is pretty much limited to SSH, DNS and
> short "test" transfers. I'll give that a try (large HTTP transfers?).

Yeah, I think it's large parallel transfers from a fast CDN (easily
maxes out a gigabit internet connection). Perhaps an "optimized"
downloader like aria2c would work similarly? `aria2c -x4 <URL>` or
something like that.

~~ Lina