From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Vbof4OGP; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id ACDC25A0271 for ; Sat, 28 Dec 2024 14:30:53 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1735392652; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CGufOlRWxw6wuI78/8FM4an+BtxH7bhx2GZi/KbRmno=; b=Vbof4OGPyqGKddEoGadVIRFxnZSYnDwEE3F0XzNr+VKX1aCEEpVcjoN5j4h8MIGw/rgikd iSrW16A16jtNZoUwO6Wiu8nEX94vDN9/sF4AvP9F2NMDM4rmLxMID8QaS1uK5NWisqpcHc AyGKER8U0Qf4TVKo2tTuJtqqCrgw67A= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-550-iRgqMUndMaOUIDLW0XcNew-1; Sat, 28 Dec 2024 08:30:50 -0500 X-MC-Unique: iRgqMUndMaOUIDLW0XcNew-1 X-Mimecast-MFC-AGG-ID: iRgqMUndMaOUIDLW0XcNew Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-385d735965bso4598412f8f.1 for ; Sat, 28 Dec 2024 05:30:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735392649; x=1735997449; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=CGufOlRWxw6wuI78/8FM4an+BtxH7bhx2GZi/KbRmno=; b=QorFOaSJZQGApsuAXQnRfWoW9rPA/WgCCiqdu2AM5x5TBNN5AfPoMrlpobyjYNCNcQ LDbcHIhZvAqGRYTduRbifZVKM3To1x5WC1rPhrattgLcUi+VTOKlR0XVkdI/HtTmyO7d 7mIc+7oMwKZ3xBR3D74xV1/1MOlCQJG/obXVFpBSnBW5jQfpnol6avJNUBwGA3By3x+w F08qzhBf7G60qC2I6KoSQwTrwSPUFXMw2boUySJx35XiRGynsu/IAeWff5NQEGT0IbxY ZKGorRWCzniqbCYd0S4kaGQp/LwQx3aP+UFjo1UYHjGZeNyGylK3qH663r9m4mR2cP05 e/hg== X-Gm-Message-State: AOJu0YzYpe4egJhns7wyDIwHz4HJKWlSVooKcVUDML9iUTW0Ow3hFShZ qoUPnadVwaseK/vHBmokOO9gLxHgtCwrNfRl13m/5ZxdCRD6tx7xvYPTldfVsrCESPpRVuVGl/j tC9RYK1LvGvy6RlzXeSeH7ne1riQEeipcc8qyHkmsHpc/HHelpA== X-Gm-Gg: ASbGncvXSxI1ITM0ntK14yTHFXNm3zHKplCy1lFuzbdqqMQum7K++iLsG6UZD6ksWg9 MqVli359TTNnKozdXwB+XQdBG5XW2fGh7IsoXgZ3OGCQrLL3Tx6MHqgQEUXt5jq8Hvdh+6WtKMW ksQyCJbgJQMeWZYnTlYU/pvvtwV2vmAWsSOayP8arxX4Zow0v32ek+WKAuY4yOUNfXJO0T5u9B4 /RjT3S6aKCq5Se1BIifHdDrRYGEHux9haWgVOyF7VAnPJj3x6dGREwq/UyNfJyavshp X-Received: by 2002:a5d:47ab:0:b0:385:e16d:51c0 with SMTP id ffacd0b85a97d-38a221eaca5mr26488631f8f.18.1735392648703; Sat, 28 Dec 2024 05:30:48 -0800 (PST) X-Google-Smtp-Source: AGHT+IHZtLomX0rJFyu4iQq0zJy73xBiXTesU/M91Iod4oxrpJwm1JD5mxZs+As7ciiWwxd/Hd4EsQ== X-Received: by 2002:a5d:47ab:0:b0:385:e16d:51c0 with SMTP id ffacd0b85a97d-38a221eaca5mr26488611f8f.18.1735392648286; Sat, 28 Dec 2024 05:30:48 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38a1c89e126sm24513236f8f.65.2024.12.28.05.30.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 28 Dec 2024 05:30:47 -0800 (PST) Date: Sat, 28 Dec 2024 14:30:44 +0100 From: Stefano Brivio To: Asahi Lina Subject: Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets Message-ID: <20241228143044.2119a803@elisabeth> In-Reply-To: <20241228-tcp-epollet-fix-v1-1-0ff63e0dff63@asahilina.net> References: <20241228-tcp-epollet-fix-v1-1-0ff63e0dff63@asahilina.net> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: zc8BWTVeBdUhRkzhmYYgc9S_NPnsc9W4_ijCp8T77zs_1735392649 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: YTTP7USQ7OLONAIQKNKTQK4QGOYOCVDZ X-Message-ID-Hash: YTTP7USQ7OLONAIQKNKTQK4QGOYOCVDZ X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Sergio Lopez , Jon Maloy X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Hi, thanks for the report and for the patch! A couple of comments: On Sat, 28 Dec 2024 20:43:15 +0900 Asahi Lina wrote: > When there is unacknowledged data in the inbound socket buffer, passt > leaves the socket in the epoll instance to accept new data from the > server. Since there is already data in the socket buffer, an epoll > without EPOLLET will repeatedly fire while no data is processed, > busy-looping the CPU: > > epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 > recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 > recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) > recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) Oops. This is unexpected, in the sense that if we have data on the socket (we leave it there until it's acknowledged by the guest), and we read all of it, the kernel wakes us up, and tcp_buf_data_from_sock() should set the (internal) STALLED flag on the socket, which implies EPOLLET: if (len <= 0) { conn_flag(c, conn, STALLED); return 0; } But here, we get -EAGAIN, so we don't touch the socket flags at all, because we don't expect that the kernel will wake us up again: if (len < 0) { if (errno != EAGAIN && errno != EWOULDBLOCK) { tcp_rst(c, conn); return -errno; } return 0; } ...and instead it does. I think it's an unintended consequence of using SO_PEEK_OFF for TCP. Jon (Cc'ed) added that a while ago, with kernel commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"), and matching passt commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available"). Long story short: as we leave unacknowledged data on the socket, we need a way to fetch new data only, and send it to the guest if we have space in the window. Before SO_PEEK_OFF for TCP, we used a "discard" buffer for recvmsg() to skip unacknowledged but "old" data. With SO_PEEK_OFF, we can directly tell the kernel to skip a given amount of bytes (the OFFset), and peek "after" that. But that changes the meaning of EAGAIN, from our perspective. Should the kernel really wake us up if there's no data after that offset? By the way, it's also the first time I see this issue, but I suspect that something like this is what's going on at: https://github.com/containers/podman/issues/23686#issuecomment-2563142828 ...and I just didn't find a way to reproduce it yet. > Add in the missing EPOLLET flag for this case. This is probably more robust in any case, and would have the obvious advantage of fixing your issue right away. But I'm not entirely sure if we can do that safely, yet. I originally disabled EPOLLET for TCP sockets in commit f2e3b9defdaa ("tcp: Drop EPOLLET for non-spliced connections") based on the observation that receiving functions don't guarantee we read everything from sockets before they return. To use EPOLLET safely, we should always loop until we get EAGAIN, and we don't do that. Now, the codebase changed significantly since then, and nowadays we can probably do that without much effort, but I need need to have a thorough look at it. Another thing to check is whether the epoll behaviour on SO_PEEK_OFF is correct. Waking up a process just to reply EAGAIN doesn't make a lot of sense. We could also consider setting STALLED in that case. Even if we come up with a kernel fix, we'll need a workaround in passt for older kernels anyway. Give me a few days... unless you or somebody else can look into it, of course. > This brings CPU > usage down from around ~80% when downloading over TCP, to ~5% (use > case: passt as network transport for muvm, downloading Steam games). Hah, maybe that's the key to reproducing this reliably. My usage of passt with muvm at the moment is pretty much limited to SSH, DNS and short "test" transfers. I'll give that a try (large HTTP transfers?). -- Stefano