From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=PfCLaAkw; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id B2E845A004E for ; Fri, 10 Jan 2025 11:28:43 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1736504922; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bVx+9USYQun5YQ6c3PoxXWfJnIgiocez4csQaxVBheU=; b=PfCLaAkwYb+bAxVqqwHRDPyKPbTMB6gJ1qyxpvS9EStdflWKin89ecDJuPdG0ZVB+LUhfe RF2ZNdLoE6ZaqwVgJTylsCYcgppiSQqUeXrHFnsrrEf+hD3xEGtDEiz36S9zXyUL91Ix5j tsZgtql9OnRN8wllZr99nsLAPBv7aqc= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-556-5sFbxcqUPDWvGFzlRuBoLg-1; Fri, 10 Jan 2025 05:28:41 -0500 X-MC-Unique: 5sFbxcqUPDWvGFzlRuBoLg-1 X-Mimecast-MFC-AGG-ID: 5sFbxcqUPDWvGFzlRuBoLg Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-4361ac607b6so15255665e9.0 for ; Fri, 10 Jan 2025 02:28:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736504920; x=1737109720; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=adGWu7iiwkRz+YgTaNlXPjMxcuENKg5R9k0ki7JA6z8=; b=F2cLTaOJmct8vlqBXMf9ixX4iZ5GnPuC1DjfPViqVWZbU0ow27GIPVpBDixLdBh+Te SZuh07+rGV0jAdGPF+fdJBGnx9NzpdhO0VDiIE157Kn1bE9z0VKsyjYwByrjdlrUm8o2 L0lJeuXTaj0GgfE6/s+B3XQdgEyb0HKSh51QsTk239N7fSJDgaUJYh9Bg79pmf0Sv/fU e2a5GPCvvHCXiWalI0Fjuz6chH9hMpxdG4iFGX5OY6MU4XOIoSMy3iOT9S2XwzRJsP9b 3U/aGs9iZnhp0gkyJsF35fsjsXmid4vpGqZddROp4Q77c1i9EdUbCOueG/gRaJd/ZdrL 5XeQ== X-Gm-Message-State: AOJu0YxjRoV7Pt7YwArxaSQ5MqpLHNHMKh6+CblYzQ5Ru+Oi1EU+hqtN C76hPWBsR4Jq7GptdNOrZfoXFeo2g5gfoTNMAyzgD37Pl15LaRxPzOJ+PC0CaNrnkKz0NCV8MoW Q0+M+rz88nGZX/G3Be8VzDvKMVKTRdRtAMJ9b3L+Z2YI2mMTlpg== X-Gm-Gg: ASbGncu4xpxJn3T6NLHE6GDWbxkUG+EYvIgv9sKhWtVeyNd3/xL/KbuXyLOGcMNumcn k+QXxvg7nUWgNy8l0/+OO16FITzTuD9TpECP2RhTAYb3khlCa1MGrO9gU4T4IBUQLr5huA6QD/o TLhe/4NrzwZOH4nHfKfwFXYKnsWPT5/k207ZPaM/ML2agUJjGhpi8vAYJg4DElO8/TWLCouGI/U naf/P0xPBh1GSWalwMZhbka/8RP3tqY5XmR9GwDGQ+QQpmFizYxMt7eQjBUweYhtTRu X-Received: by 2002:a05:600c:348c:b0:435:14d:f61a with SMTP id 5b1f17b1804b1-436e2707c4emr79430925e9.25.1736504919625; Fri, 10 Jan 2025 02:28:39 -0800 (PST) X-Google-Smtp-Source: AGHT+IEZHcYXLomj3Q1CbaU2zy17N+dGyHeENzfZKfwUAx2Y+RN7e9I05dlTsqdGdunkztoDRQW7/w== X-Received: by 2002:a05:600c:348c:b0:435:14d:f61a with SMTP id 5b1f17b1804b1-436e2707c4emr79430695e9.25.1736504919130; Fri, 10 Jan 2025 02:28:39 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-436e9d8fc67sm48305605e9.8.2025.01.10.02.28.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 10 Jan 2025 02:28:38 -0800 (PST) Date: Fri, 10 Jan 2025 11:28:36 +0100 From: Stefano Brivio To: Asahi Lina Subject: Re: [PATCH] tcp: Add missing EPOLLET flag for established sockets Message-ID: <20250110112836.50e8c63e@elisabeth> In-Reply-To: <20241229132530.33beab46@elisabeth> References: <20241228-tcp-epollet-fix-v1-1-0ff63e0dff63@asahilina.net> <20241228143044.2119a803@elisabeth> <20241229132530.33beab46@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: vUTYofKxju3l1xKE-KqA-kzXm7fmcPDNjQ7LwRmIGqE_1736504920 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: WWNT6XH4LYGHAT253H7D7X2GOFMMYA3L X-Message-ID-Hash: WWNT6XH4LYGHAT253H7D7X2GOFMMYA3L X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Sergio Lopez , Jon Maloy X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Sorry, it took a while (and I could reproduce this with muvm, too): On Sun, 29 Dec 2024 13:25:30 +0100 Stefano Brivio wrote: > On Sun, 29 Dec 2024 00:31:11 +0900 > Asahi Lina wrote: >=20 > > I was thinking of this, maybe the easiest solution would be to just set > > STALLED if we get an EAGAIN *and* the current peek offset is nonzero. = =20 >=20 > Actually, I think this is the correct solution (with "peek offset" here > meaning the actual offset at which we peek, not the value we pass to > SO_PEEK_OFF, see below), because it substantially restores the > behaviour that was intended before e63d281871ef ("tcp: leverage support > of SO_PEEK_OFF socket option when available"). >=20 > That's what STALLED was for: if there's data on the socket, but we > can't read anything more, set EPOLLET. An ACK from the guest will make > us clear EPOLLET and have another look. >=20 > Right now, STALLED is not always set when it should. If it's cleared > because of an ACK which doesn't acknowledge all the data pending on the > socket, then it's broken, because of the early return on EAGAIN. >=20 > Note that we generally set SO_PEEK_OFF to 0 (and reset it on > retransmissions, that is, whenever we need to "re-read" data), but that > does *not* mean that the offset is zero: it means that the offset is > "what we last peeked" (see also socket(7)). >=20 > I know, it sounds terribly confusing, but it makes SO_PEEK_OFF actually > useful (the overhead from system calls would otherwise make it not > worth using). >=20 > The actual peeking offset will be zero just after tcp_sock_consume(), > without an intervening EPOLLIN, and in a particular case of > tcp_revert_seq() (retransmission). >=20 > I'm not sure if we can ever have EAGAIN from the socket with a zero > peeking offset. If we can, then we need to track that. If we can't, > then what you suggest is equivalent to just setting STALLED if we get > EAGAIN. So, it looks like there are no paths where we would get EAGAIN with a zero peeking offset. Setting STALLED in both tcp_buf_data_from_sock() and tcp_vu_data_from_sock() before returning on EAGAIN actually implements what you suggested, solves the problem, and doesn't create new ones (at least in my tests). About setting EPOLLET unconditionally, we also discussed this with David Gibson in the latest development call: we could, in theory (and he even tried in the past), because there are two cases where we return before dequeuing all the data available from the socket (and we need a subsequent wake-up even if no further data arrives): - if we can't queue more data to the guest, in window. But then we can always rely on the guest waking us up with an ACK (or the ACK timeout handler, which we have) - if we fail to write to the hypervisor (or tap device, for containers). It shouldn't really happen under normal circumstances, but there are no guarantees. In that case, we *should* set EPOLLOUT on that descriptor, and keep track of all the sockets that had pending data, and poll them once the guest descriptor is ready for writing, otherwise they would get stuck (at least until a timeout occurs, but that might be bad enough to break functionality). But we don't have this implementation, yet. For the moment, I think we should set STALLED like you suggested (in those two functions, before the early return). Do you want to send a second version of the patch, or should I? I would actually be happy to add more names to the git log. :) > By the way, I wonder, if it's not too much effort to check: do you hit > this without SO_PEEK_OFF (return false in tcp_probe_peek_offset_cap())? >=20 > > > Give me a few days... unless you or somebody else can look into it, o= f > > > course. > > > =20 > > >> This brings CPU > > >> usage down from around ~80% when downloading over TCP, to ~5% (use > > >> case: passt as network transport for muvm, downloading Steam games).= =20 > > >=20 > > > Hah, maybe that's the key to reproducing this reliably. My usage of > > > passt with muvm at the moment is pretty much limited to SSH, DNS and > > > short "test" transfers. I'll give that a try (large HTTP transfers?).= =20 > >=20 > > Yeah, I think it's large parallel transfers from a fast CDN (easily > > maxes out a gigabit internet connection). Perhaps an "optimized" > > downloader like aria2c would work similarly? `aria2c -x4 ` or > > something like that. =20 >=20 > Perhaps, yes, thanks for the tip. >=20 > That's something I do quite frequently with QEMU guests (updating > distribution packages from a mirror very close, large parallel > transfers with iperf3, etc.), but maybe the virtio-net implementation > in libkrun is somehow peculiar in this regard. I could reproduce only with muvm and aria2c -x4 as you suggested, with a long transfer, *not* local but almost: same data centre and about ~300=C2=B5s RTT. More than that, and I don't see the issue, because the guest's window is always large enough otherwise (I guess). But yes, there's a peculiarity with muvm: the MTU is set to just 1500 bytes, whereas we typically have 65520 bytes with QEMU. With user-mode networking, given that packetisation is (re-)done on the host, anything less than what the guest interface supports doesn't make sense. I'm using dhcpcd, but https://github.com/AsahiLinux/muvm/pull/111 makes no difference (yet) because I forgot to handle the MTU there (passt advertises 65520). For some reason, dhcpcd is also not setting the MTU. I'll fix this there (or in a follow-up change). If I manually set 65520 bytes, by the way, I can't reproduce the issue anymore. --=20 Stefano