From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=asahilina.net Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=asahilina.net header.i=@asahilina.net header.a=rsa-sha256 header.s=default header.b=trOlS+5F; dkim-atps=neutral Received: from mail.marcansoft.com (marcansoft.com [212.63.210.85]) by passt.top (Postfix) with ESMTPS id 12B165A004E for ; Sat, 28 Dec 2024 16:31:15 +0100 (CET) Received: from [127.0.0.1] (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: lina@asahilina.net) by mail.marcansoft.com (Postfix) with ESMTPSA id 6DEE541DF4; Sat, 28 Dec 2024 15:31:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=asahilina.net; s=default; t=1735399873; bh=N7bjd6wH+jA5KCof1kKNNA2vQ2LfjRHtZuvYwGox7/o=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=trOlS+5F45BphMZXyjcikwOwjHTwdosdN6hlv/vuCn85zxke2uxjDjm7pOt+vwxQJ L9TxaXANJReriVJKwZXqULmGhRiOQD+3bSuKYLQ1mZgBhqSUrfyZA7q+hD4Zgsx/92 xHYBrMSkajhLslgYTI47EIjA3CwjzO1dst3KCKQ57Tu3OiWO+JCa9YlKcqJ2bhK2rb XUEsSDbFfmaE7JhGAzX8RHpPyDh1OaLP9Oj2/oJmDm8ZyiL8P0y6hfL7PMI+vwFsed yS5tI17DjHHgyprTnERb8WgWEK8E3oiKxVCy9CUeIiBOyXNobGmU44lhR8Sw/aGrvD 4SNNj8GdQ0leQ== Message-ID: Date: Sun, 29 Dec 2024 00:31:11 +0900 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets To: Stefano Brivio References: <20241228-tcp-epollet-fix-v1-1-0ff63e0dff63@asahilina.net> <20241228143044.2119a803@elisabeth> Content-Language: en-US From: Asahi Lina In-Reply-To: <20241228143044.2119a803@elisabeth> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Message-ID-Hash: AH2UCHXX5LU63BVPU5RC4VZRMQO3WYVO X-Message-ID-Hash: AH2UCHXX5LU63BVPU5RC4VZRMQO3WYVO X-MailFrom: lina@asahilina.net X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Sergio Lopez , Jon Maloy X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Hi, On 12/28/24 10:30 PM, Stefano Brivio wrote: > Hi, thanks for the report and for the patch! A couple of comments: > > On Sat, 28 Dec 2024 20:43:15 +0900 > Asahi Lina wrote: > >> When there is unacknowledged data in the inbound socket buffer, passt >> leaves the socket in the epoll instance to accept new data from the >> server. Since there is already data in the socket buffer, an epoll >> without EPOLLET will repeatedly fire while no data is processed, >> busy-looping the CPU: >> >> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 >> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) >> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) >> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) >> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) >> epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 >> recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) >> recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) >> recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) >> recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > > Oops. This is unexpected, in the sense that if we have data on the > socket (we leave it there until it's acknowledged by the guest), and we > read all of it, the kernel wakes us up, and tcp_buf_data_from_sock() > should set the (internal) STALLED flag on the socket, which implies > EPOLLET: > > if (len <= 0) { > conn_flag(c, conn, STALLED); > return 0; > } > > But here, we get -EAGAIN, so we don't touch the socket flags at all, > because we don't expect that the kernel will wake us up again: > > if (len < 0) { > if (errno != EAGAIN && errno != EWOULDBLOCK) { > tcp_rst(c, conn); > return -errno; > } > > return 0; > } > > ...and instead it does. > > I think it's an unintended consequence of using SO_PEEK_OFF for TCP. > Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3 > ("tcp: add support for SO_PEEK_OFF socket option"), and matching > passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF > socket option when available"). > > Long story short: as we leave unacknowledged data on the socket, we need > a way to fetch new data only, and send it to the guest if we have space > in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer > for recvmsg() to skip unacknowledged but "old" data. > > With SO_PEEK_OFF, we can directly tell the kernel to skip a given > amount of bytes (the OFFset), and peek "after" that. > > But that changes the meaning of EAGAIN, from our perspective. Should > the kernel really wake us up if there's no data after that offset? Right, I wasn't explicit in the commit message but I realized this was caused by SO_PEEK_OFF when I found that in the code, since otherwise the strace log makes no sense (it would be incorrect to wake up a process if there is no data available for reading at all, so at first I thought it could be a weirder bug with event processing triggering the wrong fds to be polled or something like that). It clicked once I realized what passt is doing with SO_PEEK_OFF. In this case, there *is* data available for reading, and it would be returned by a recvmsg call without MSG_PEEK, so I think the kernel behavior is correct, since epoll has no idea about SO_PEEK_OFF or that the socket is being polled for peeking after an offset. For level-triggered mode, epoll should return EPOLLIN as long as there's any data in the buffer that would be readable by a regular recvmsg() (I'm pretty sure it's not supposed to "remember" you're using MSG_PEEK or something like that, it's supposed to be stateless). My conclusion is that an event polling loop using MSG_PEEK like this must use EPOLLET. That's why I just added the flag. EPOLLET isn't stateful in terms of receive syscalls nor knows anything about MSG_PEEK or anything like that either, it simply guarantees that when data is added to the receive buffer there will be one subsequent notification triggered, which I think is what we want (as long as we always empty out the buffer before polling again). > By the way, it's also the first time I see this issue, but I suspect > that something like this is what's going on at: > > https://github.com/containers/podman/issues/23686#issuecomment-2563142828 > > ...and I just didn't find a way to reproduce it yet. I haven't tried other workloads, but the Steam downloading thing is very consistent. It's easy to catch in `strace` scrolling around a bit. I think since `strace` adds latency, it actually minimizes the problem a bit (giving the guest time to ACK data) and it's likely that when not being traced like that, passt is actually spending a huge number of syscalls in the busy-wait condition (which is why CPU usage balloons). Several people noticed this with passt in the past, this is actually just the first time I look into it. (Actually, I was looking to debug a different bug causing networking to get stuck entirely, but I haven't gotten there yet, I just ran into this one first.) > >> Add in the missing EPOLLET flag for this case. > > This is probably more robust in any case, and would have the obvious > advantage of fixing your issue right away. > > But I'm not entirely sure if we can do that safely, yet. I originally > disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop > EPOLLET for non-spliced connections") based on the observation that > receiving functions don't guarantee we read everything from sockets > before they return. > > To use EPOLLET safely, we should always loop until we get EAGAIN, and > we don't do that. Now, the codebase changed significantly since then, > and nowadays we can probably do that without much effort, but I need > need to have a thorough look at it. > > Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is > correct. Waking up a process just to reply EAGAIN doesn't make a lot of > sense. (See above, I think it's correct) > We could also consider setting STALLED in that case. Even if we come up > with a kernel fix, we'll need a workaround in passt for older kernels > anyway. I was thinking of this, maybe the easiest solution would be to just set STALLED if we get an EAGAIN *and* the current peek offset is nonzero. > > Give me a few days... unless you or somebody else can look into it, of > course. > >> This brings CPU >> usage down from around ~80% when downloading over TCP, to ~5% (use >> case: passt as network transport for muvm, downloading Steam games). > > Hah, maybe that's the key to reproducing this reliably. My usage of > passt with muvm at the moment is pretty much limited to SSH, DNS and > short "test" transfers. I'll give that a try (large HTTP transfers?). Yeah, I think it's large parallel transfers from a fast CDN (easily maxes out a gigabit internet connection). Perhaps an "optimized" downloader like aria2c would work similarly? `aria2c -x4 ` or something like that. ~~ Lina