From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=OLwcbPGd; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id BDDD75A026E for ; Thu, 21 May 2026 07:40:47 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779342046; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jlesjWtViZ8uW3PiJcp/9ygLEiDNqwWAwo4yHXx2Nzc=; b=OLwcbPGdBhzRbGLRkV9ZIKBepIrNzsETWouevruTWkXH4Avm2cHHIhLBun/zOKkBICV4YQ kczt1MptDYrcpzdfPTQWnsKnY8t2NIigHOetj9+Q/l4is3F84YNkUorKi+npLOuYa6fq0W /2r0RPHd3FVHv4u2h/hZaXYuhh6LJYw= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-528-8LAHL3h3OiGjqCO-FSEgEg-1; Thu, 21 May 2026 01:40:43 -0400 X-MC-Unique: 8LAHL3h3OiGjqCO-FSEgEg-1 X-Mimecast-MFC-AGG-ID: 8LAHL3h3OiGjqCO-FSEgEg_1779342034 Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-44b186b715aso3579166f8f.0 for ; Wed, 20 May 2026 22:40:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779342034; x=1779946834; h=date:content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jlesjWtViZ8uW3PiJcp/9ygLEiDNqwWAwo4yHXx2Nzc=; b=Guam3Pry7xocvESHajtIa44pm9W1O035NPhaaHurseEps70fsjF3vVMuZkRMJ/Wi4B Ttl9xwxqTIdhK8Gze2nJGHLdLjcBhTzy8dvPw8oag84jFtYktSicJsqnnTmx0GbH9FxC uiF092CWD80PTLzmNshBT0bEgo9pM4Q5THlZ5ROJKyCMteI1G1Ww1Fj1NMm2FAIukJHK ZKXO7XoEyAShx7WRvixnuDAoNDhCzJYVuP3/Iam0uno/U0gBA2UCSyzXcPGm4J2fyVil 9MSf3kmxgKqTyt6TXQ8onGLHN/SxL2DJCAp70DLABIXjqyoNMBYLa7yHVeMSQsvmqDJr /5bw== X-Gm-Message-State: AOJu0Yyos3BrGvbYABlmW0n8BO1RpmYF4r13CXY2o1eyj0sEfF4VXoVU dtXzjOkb4M7eiJnOlyDwNqg+UEl+d/zrrOidCG3POTo6wUPYeIAzvdIp+ycurNRlWqmDGsmpZZU Vtd86iUB5UQUOtHaKLLsO9ubv92PIQewqxGqK43XO9URHuhj4p2THpQ== X-Gm-Gg: Acq92OEOz77iykNcvBqo7aAnsyuRu1LozwYbwksGsezbzErPiOph/R94df5uhi57t2L 0zHVmUZCN/hnfg68zhZbXganNjhsdxn9s91mL2CGWZRhlqTfyJM1sIgM1ed0X4wj4wzKw9bfSsr G2MDZJ7rSbcTwCzQscHyKS86N9/VDklkJ+oP4r2ZR5klmv/6cyZWAhtefCjblM5g/BMIwYtUid0 ORwd0hbLheQQCeO7UZ0faB/Ik2xLTRKbenTTIRVvQdpKHN4DgTA98J7QT3cZCasjv8HYfL6slk7 a3xxVgH8Ek0i99+Wg0C/HjmHKLZ1G1sT36C1+k3pbxCA2j0hvhvUQ4BTQL1s4yZg/u36YlwtS3i e8/M82WM1sDr+BsUyQG+KMvOiRhA1vWTq X-Received: by 2002:a05:600c:1547:b0:48f:fb0d:8d86 with SMTP id 5b1f17b1804b1-490360f3d25mr17157455e9.32.1779342033597; Wed, 20 May 2026 22:40:33 -0700 (PDT) X-Received: by 2002:a05:600c:1547:b0:48f:fb0d:8d86 with SMTP id 5b1f17b1804b1-490360f3d25mr17157085e9.32.1779342033011; Wed, 20 May 2026 22:40:33 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-49033d52c8bsm42774685e9.8.2026.05.20.22.40.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2026 22:40:32 -0700 (PDT) From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling Message-ID: <20260521074030.0e15b36e@elisabeth> In-Reply-To: References: <20260520130851.436931-1-david@gibson.dropbear.id.au> <20260520130851.436931-6-david@gibson.dropbear.id.au> <20260520223003.37ceb0f8@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 Date: Thu, 21 May 2026 07:40:31 +0200 (CEST) X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: ltQ539ovxVkyYvD_HR58icVe8V6fFdnYhP0a6HaPwLI_1779342034 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: NNFBETGPXMZGELSJKX3USFEVXBWRTEQ3 X-Message-ID-Hash: NNFBETGPXMZGELSJKX3USFEVXBWRTEQ3 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 21 May 2026 12:03:33 +1000 David Gibson wrote: > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote: > > On Wed, 20 May 2026 23:08:50 +1000 > > David Gibson wrote: > > > > > There are two ways we can tell one of our sockets has received a FIN. We > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read > > > (EOF) on the socket. We currently use both, in a mildly confusing way: > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then > > > some other close out logic is based on seeing an EOF. > > > > > > Simplify this by setting the flag based on only the EOF. To make sure we > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the > > > forwarding path for EPOLLRDHUP as well as EPOLLIN. > > > > > > Signed-off-by: David Gibson > > > --- > > > tcp_splice.c | 14 +++++--------- > > > 1 file changed, 5 insertions(+), 9 deletions(-) > > > > > > diff --git a/tcp_splice.c b/tcp_splice.c > > > index 8fbd490f..b45f0060 100644 > > > --- a/tcp_splice.c > > > +++ b/tcp_splice.c > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei); > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei); > > > int never_read = 1; > > > - int eof = 0; > > > > > > while (1) { > > > ssize_t readlen, written; > > > @@ -510,7 +509,7 @@ retry: > > > flow_trace(conn, "%zi from read-side call", readlen); > > > > > > if (!readlen) { > > > - eof = 1; > > > + conn_event(conn, FIN_RCVD(fromsidei)); > > > > I'm not sure if I really found a concrete issue with this, but it looks > > a bit scary, because it changes the semantics of FIN_RCVD, which used to > > mean that we infer we received a FIN, regardless of whether we're done > > processing all data from that half of the connection. > > > > Now FIN_RCVD is only set if we actually processed all the data and we > > hit the end of file. > > True. But the only place that tested FIN_RCVD was at the end of > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this > was the cause of bug202 - we had FIN_RCVD set, but we didn't process > it and shutdown() on the other side, because we didn't have eof. That sounds like a good motivation to clean this up, just two concerns below: > > The (potential) issue I see here is that we get EPOLLRDHUP, splice() > > returns -1 with EAGAIN in errno because we had no room in the pipe, > > and it would have returned 0 instead. > > > > Will we ever get our zero-sized "read" later? If not, we might have > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have > > guarantees in that sense from splice(). > > It's not really about guarantees from splice. I'm pretty sure this is > ok, reasoning as follows. > > Consider all the exit points from the loop body: > - Each return is a return -1, so we kill the connection anyway. They > don't matter > - Each continue, goto retry and the end of the body will do the read > side splice() again, so get another chance to see the EOF > - That leaves just the breaks > > Consider each break (there are three, since patch 2 of this series) > if (written < 0) { > if (!conn->pending[fromsidei]) > break; > > (1) The pipe is empty and the write-splice returned EAGAIN, so it > didn't remove data from the pipe. You're assuming that !conn->pending[fromsidei] means that the pipe is empty. From what we see of it, it is. What the kernel can do with it, though, is different. It might return EAGAIN even if we think we should have space, because it's resizing it under memory pressure or anything like that. Or it delays freeing up space or accounting for whatever reason. So it would be nice to make this part robust to that. I thought setting FIN_RCVD on EPOLLRDHUP was a good way to achieve that. > Therefore, the pipe must have been > empty before the write-splice. Which means the read-splice can't have > blocked on a full pipe. > conn_event(conn, OUT_WAIT(!fromsidei)); > break; > } > > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it > must have blocked on the output socket. We've set OUT_WAIT(), so > we'll get an EPOLLOUT at some point which will cause us to read-splice > again, meaning we get another chance to see the EOF. ...later. But what if we don't get a zero-sized read *at all*? I'm not sure if splice() guarantees we do get one if we reach end-of-file. That's something valid and very well established for read() and recv(), but splice() is a bit weird. The documentation says: A return value of 0 means end of input. but I wouldn't assume we'll *always* get at least one in case of EOF. > > [...] > if (conn->events & FIN_RCVD(fromsidei)) > break; > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF. > > > The existing implementation distinguishes between end-of-file we hit in > > a given iteration, and EPOLLRDHUP we might have seen at any time. > > That was actually intended. > > It might be intended, but I can't see that we did anything with that > information. We always set FIN_RCVD on it. You're right, if we only checked that on 'eof', that didn't solve much, but that wasn't necessarily intended. My original intention was to make setting of FIN_RCVD (or whatever it was originally) robust. > That said the conditions on which we exit / retry this loop are pretty > darn confusing. I'll see if I can improve them. -- Stefano