From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=iLdr33Rg; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id C9DCD5A0274 for ; Thu, 16 Jan 2025 01:45:08 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1736988307; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/8SZS3hMO1ub/zMP8H1TwOyOiKHpF8QEErS+DV5gL4A=; b=iLdr33RgU5FNWhteIbXYxv3L1hOGujTpk2f6UetDC1SD4DB5QD4SSC6JRdFiJeKXWvdhVb gvFIOK/Oy0MFWS+b1TT8kk+yilHw5eiXsMCOlWNKFMXGrrc2pY3EC/jRCLlFwkVWZNBaS5 VPO7u7Gn3RAflnMc18aKCB8Ng4+x6Vg= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-578-BNG5JtHuN-mNAn4dVvGY9w-1; Wed, 15 Jan 2025 19:45:06 -0500 X-MC-Unique: BNG5JtHuN-mNAn4dVvGY9w-1 X-Mimecast-MFC-AGG-ID: BNG5JtHuN-mNAn4dVvGY9w Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-38a2140a400so269479f8f.0 for ; Wed, 15 Jan 2025 16:45:06 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736988305; x=1737593105; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=/8SZS3hMO1ub/zMP8H1TwOyOiKHpF8QEErS+DV5gL4A=; b=A1A+KQfe5glvfZySpHPA1T47+OMuHV2+6zBDOYwtJ2fLC78eMC164rvOOFCdDV6soy tC9AMp7OOdGgAFYOxpW122q9ar7jWjX1UI4/pJcZcVOFnOzfX5NlxGpVW1aY/edfKyPT mLO3PMkvb4d9aJsL4tExsLQeETvUPIwURRtk7fXYHgAqICffEsXXMc5dQSclO9J9WO7I EV8dzAm4rMb/Mf83w0HqU0qNBmNFoIEElJLjEnkOzz1ff/KmwR5COCjEPZt3pqEnV7El 1y5Wb4i4vurYtKYqfSJWsaGlxn7IrYPonxTDpRczDlkrg878lDW7/EtXjeeCTS3/9NGt ExvA== X-Gm-Message-State: AOJu0Yypndqo0RevspytxV1QABDL0taYy8QGujOxG4WYBRZlpFMZm+dH rFMfmQD7n98Fz9dj62ccbpruLzVUJ3qnY7TvbtjBrwxaTbKmEOsSPo+DrNSvTIZR/Dw97cJlomY +1AeLp3LGt6I628GkOpdEHk4RB9HPDIEvN96QV8M4n03rMganKQ== X-Gm-Gg: ASbGncsU3j8Y4MKKK4N1X6afDuD158tgymfqv3W5CJwZhVRLS3Q/KKgsQ0cxoWAB6j6 2rgWVHhClfR1vyudCngV8D247A/XObhirh0EiqVnAbVvJ+QPr6La+RrYyNcI4Oj/oyS0ufEtdne rA6WsrugYIeLD8PbKIBZlFDiZMoGEnysQr6PvKanWPyG4ACZG2vibK0kjeXbWHRcJ7CLPrEtcTQ ufCjgzucQZ7V+DZU0s8vzwvyE9r18n8nY9J9j4ORGvTADmnvkwIRyStvZ0R0ycHjf+TxbbZDxWu 3nFunSqPtw== X-Received: by 2002:a05:6000:4b0a:b0:385:fae2:f443 with SMTP id ffacd0b85a97d-38a87313975mr27737451f8f.34.1736988304922; Wed, 15 Jan 2025 16:45:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IGQ6zLdtaBS9AQUvvDUHDB2SGM4wr7Ey0bS88Z6bNhe1zRpfIBMfKzEWebiPtzyQNC62IMLhQ== X-Received: by 2002:a05:6000:4b0a:b0:385:fae2:f443 with SMTP id ffacd0b85a97d-38a87313975mr27737434f8f.34.1736988304531; Wed, 15 Jan 2025 16:45:04 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-437c74c6799sm40655085e9.24.2025.01.15.16.45.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Jan 2025 16:45:02 -0800 (PST) Date: Thu, 16 Jan 2025 01:45:00 +0100 From: Stefano Brivio To: passt-dev@passt.top Subject: Re: [PATCH] tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Message-ID: <20250116014500.7da58696@elisabeth> In-Reply-To: <20250114223316.3995265-1-sbrivio@redhat.com> References: <20250114223316.3995265-1-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Je2ExOng-F83jV3sautax2AAERwxtWaPm4TbMPRzSlo_1736988305 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: YKCR4ECDPZNMQPUKYGIQNMZX3XSD53TA X-Message-ID-Hash: YKCR4ECDPZNMQPUKYGIQNMZX3XSD53TA X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Asahi Lina , Sergio Lopez , Jon Maloy , Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, 14 Jan 2025 23:33:16 +0100 Stefano Brivio wrote: > Before SO_PEEK_OFF support was introduced by commit e63d281871ef > ("tcp: leverage support of SO_PEEK_OFF socket option when available"), > we would peek data from sockets using a "discard" buffer as first > iovec element, so that, unless we had no pending data at all, we would > always get a positive return code from recvmsg() (except for closing > connections or errors). > > If we couldn't send more data to the guest, in the window, we would > set the STALLED flag (causing the epoll descriptor to switch to > edge-triggered mode), and return early from tcp_data_from_sock(). > > With SO_PEEK_OFF, we don't have a discard buffer, and if there's data > on the socket, but nothing beyond our current peeking offset, we'll > get EAGAIN instead of our current "discard" length. In that case, we > return even earlier, and we don't set EPOLLET on the socket as a > result. > > As reported by Asahi Lina, this causes event loops where the kernel is > signalling socket readiness, because there's data we didn't dequeue > yet (waiting for the guest to acknowledge it), but we won't actually > peek anything new, and return early without setting EPOLLET. > > This is the original report, mentioning the originally proposed fix: > > -- > When there is unacknowledged data in the inbound socket buffer, passt > leaves the socket in the epoll instance to accept new data from the > server. Since there is already data in the socket buffer, an epoll > without EPOLLET will repeatedly fire while no data is processed, > busy-looping the CPU: > > epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 > recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 > recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > > Add in the missing EPOLLET flag for this case. This brings CPU > usage down from around ~80% when downloading over TCP, to ~5% (use > case: passt as network transport for muvm, downloading Steam games). > -- > > we can't set EPOLLET unconditionally though, at least right now, > because we don't monitor the guest tap for EPOLLOUT in case we fail > to write on that side because we filled up that buffer (and not the > window of a TCP connection). > > Instead, rely on the observation that we only get EAGAIN on recvmsg() > if we are attempting to peek data from a socket with a non-zero > peeking offset: we only peek when there's pending data on a socket, > and in that case, if we peek without offset, we'll always see some > data. > > And if we peek data with a non-zero offset and get EAGAIN, that means > that we're either waiting for more data to arrive on the socket (which > would cause further wake-ups, even with EPOLLET), or we're waiting for > the guest to acknowledge some of it, which would anyway cause a > wake-up. > > In that case, it's safe to set STALLED and, in turn, EPOLLET on the > socket, which fixes the EPOLLIN event loop. > > Reported-by: Asahi Lina > Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available") > Signed-off-by: Stefano Brivio > --- > tcp_buf.c | 1 + > tcp_vu.c | 2 ++ > 2 files changed, 3 insertions(+) > > diff --git a/tcp_buf.c b/tcp_buf.c > index a975a55..577467b 100644 > --- a/tcp_buf.c > +++ b/tcp_buf.c > @@ -359,6 +359,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > return -errno; > } > > + conn_flag(c, conn, STALLED); > return 0; > } > > diff --git a/tcp_vu.c b/tcp_vu.c > index 10e17d3..22ae71d 100644 > --- a/tcp_vu.c > +++ b/tcp_vu.c > @@ -399,6 +399,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > tcp_rst(c, conn); > return len; > } > + > + conn_flag(c, conn, STALLED); This occasionally makes the tests hang at: passt_vu_in_ns/tcp [19/32] - TCP/IPv6: guest to host: big transfer for the vhost-user case only. Yes, guest to host. I didn't really check why and how, I thought the vhost-user path here would pretty much match the other one, but that doesn't seem to be the case. Perhaps we don't always wake up if the guest acknowledges something. > return 0; > } > -- Stefano