From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=S+Zcdd6F; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id A36225A0265 for ; Thu, 21 May 2026 09:15:22 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779347721; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xhWASuZ3igodLyvq5b+s2cT8yRBBUVv2vhLdBd+b1KE=; b=S+Zcdd6F0E1KKhEj+YP43DI6Qancug8eLRhbGUWy8DY0hg6fAPMGM05l/M4r/C+5nDR6Jp fBmnpPnIds+OIRBu2PWzv9tQssvKPcozUpgSfSbnovGcgGyzVPx3RY0KWtE7PYO1vDbVh0 dOorM62KgC9YOBNQ3PNxmAZ2gDNa+us= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-564-U3BNWhaZMNmEt5inyUawjQ-1; Thu, 21 May 2026 03:15:18 -0400 X-MC-Unique: U3BNWhaZMNmEt5inyUawjQ-1 X-Mimecast-MFC-AGG-ID: U3BNWhaZMNmEt5inyUawjQ_1779347717 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-43fe791a398so4588266f8f.0 for ; Thu, 21 May 2026 00:15:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779347717; x=1779952517; h=date:content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xhWASuZ3igodLyvq5b+s2cT8yRBBUVv2vhLdBd+b1KE=; b=hjO5koTtAts5jBYY6y1jYQNnqJ6Ea7bNm1PSqUazQBQh7Irrvf5xcXCbrzcVgFGDkV sQWS4CvC42iVM4LKq1J4nvyX1gmdEnfB2ZJyUq3Hhtc9nn9LGwXNKDPEqkwj6732uNok hHv5d2rH81eMkPZxNCg5dbGaFXFnNenmoRK2LFk7XRfsJI18VfB39kYdxa2+2K0KEo+Y hM2oYLzCYkOwygg9PwRme9dV4CnwU/aBeC97fCvKwEXFxf/M12Dwq6/p3z4Ju1D6aawo 41RbyNSH+zByj3MYv2Wj16qQ/isIKq+hOoQt1QWbMNcroB/gWVXHQomR1p0kvFIXqwgv 5k7Q== X-Gm-Message-State: AOJu0YzIs/jpnyQGSB449gimUaLfba560789htKpHHy1euIk90XYy0T1 tSoZ9BTN2hSTJu4c5uLG9STTwH/yC/m4bAxcykMPXxjJ9FWYpGSQsKBL+dItQodSaCGGb8nC4xg pvmq4aLi+pT7v+C6/rIjavXyl/CTRXSYi3qKEJbpw92aMjSz0+goO1g== X-Gm-Gg: Acq92OE8zEWHIo4bNihlbb+mNxj3xMmRm1+3sFeCPrtziAPS07VHjiMcr6iNMCfBjbR SVP0Mq00Lw/0fOLm/IKav7CPHhTV3GTWzDKAe+9hUfarIK4hHYacnLIGZ5fxJZcH8My7/sR3jNE NxMfHhgmHLf7iUxmVJsxY0DJEWHAQb1eb4dCIJcVjNjTceTcM5wjUUjDevwj7yNn6WgCtJfbwvd u4EaAJUMeoOAJOaqp9Zn59cr3J0nESV3G0BbOu/9+tQqZBhGegkGjO04uWkGiwV18O0uuiV9e6s h8NmyC+dLpkld8ZxBAfK+yhBfK7qVw+nNg3eV5or+chaAUK0p4lbtjg2HLEATbAWEUAj9WHgb+j QffEdev09Ve3LlsHX3Q4vVR9pADC+1TUgglNqka+TxgQTPgiZDA== X-Received: by 2002:a05:600d:2:b0:48f:e1ac:c94f with SMTP id 5b1f17b1804b1-4903605f04emr17110235e9.10.1779347716257; Thu, 21 May 2026 00:15:16 -0700 (PDT) X-Received: by 2002:a05:600d:2:b0:48f:e1ac:c94f with SMTP id 5b1f17b1804b1-4903605f04emr17108035e9.10.1779347714956; Thu, 21 May 2026 00:15:14 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-45eaa7cf58fsm897990f8f.3.2026.05.21.00.15.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 May 2026 00:15:14 -0700 (PDT) From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling Message-ID: <20260521091512.1ede0a84@elisabeth> In-Reply-To: References: <20260520130851.436931-1-david@gibson.dropbear.id.au> <20260520130851.436931-6-david@gibson.dropbear.id.au> <20260520223003.37ceb0f8@elisabeth> <20260521074030.0e15b36e@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 Date: Thu, 21 May 2026 09:15:13 +0200 (CEST) X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: U8DW73o0NPSIhohSl6SH_2cDrQfrVI4KBVJBtbV9xHo_1779347717 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: LQGQPLBLOWGMQHLLZPEWF7I7N36WRYAM X-Message-ID-Hash: LQGQPLBLOWGMQHLLZPEWF7I7N36WRYAM X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 21 May 2026 16:56:43 +1000 David Gibson wrote: > On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote: > > On Thu, 21 May 2026 12:03:33 +1000 > > David Gibson wrote: > > > > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote: > > > > On Wed, 20 May 2026 23:08:50 +1000 > > > > David Gibson wrote: > > > > > > > > > There are two ways we can tell one of our sockets has received a FIN. We > > > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read > > > > > (EOF) on the socket. We currently use both, in a mildly confusing way: > > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then > > > > > some other close out logic is based on seeing an EOF. > > > > > > > > > > Simplify this by setting the flag based on only the EOF. To make sure we > > > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the > > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN. > > > > > > > > > > Signed-off-by: David Gibson > > > > > --- > > > > > tcp_splice.c | 14 +++++--------- > > > > > 1 file changed, 5 insertions(+), 9 deletions(-) > > > > > > > > > > diff --git a/tcp_splice.c b/tcp_splice.c > > > > > index 8fbd490f..b45f0060 100644 > > > > > --- a/tcp_splice.c > > > > > +++ b/tcp_splice.c > > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct > > > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei); > > > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei); > > > > > int never_read = 1; > > > > > - int eof = 0; > > > > > > > > > > while (1) { > > > > > ssize_t readlen, written; > > > > > @@ -510,7 +509,7 @@ retry: > > > > > flow_trace(conn, "%zi from read-side call", readlen); > > > > > > > > > > if (!readlen) { > > > > > - eof = 1; > > > > > + conn_event(conn, FIN_RCVD(fromsidei)); > > > > > > > > I'm not sure if I really found a concrete issue with this, but it looks > > > > a bit scary, because it changes the semantics of FIN_RCVD, which used to > > > > mean that we infer we received a FIN, regardless of whether we're done > > > > processing all data from that half of the connection. > > > > > > > > Now FIN_RCVD is only set if we actually processed all the data and we > > > > hit the end of file. > > > > > > True. But the only place that tested FIN_RCVD was at the end of > > > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this > > > was the cause of bug202 - we had FIN_RCVD set, but we didn't process > > > it and shutdown() on the other side, because we didn't have eof. > > > > That sounds like a good motivation to clean this up, just two concerns > > below: > > > > > > The (potential) issue I see here is that we get EPOLLRDHUP, splice() > > > > returns -1 with EAGAIN in errno because we had no room in the pipe, > > > > and it would have returned 0 instead. > > > > > > > > Will we ever get our zero-sized "read" later? If not, we might have > > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have > > > > guarantees in that sense from splice(). > > > > > > It's not really about guarantees from splice. I'm pretty sure this is > > > ok, reasoning as follows. > > > > > > Consider all the exit points from the loop body: > > > - Each return is a return -1, so we kill the connection anyway. They > > > don't matter > > > - Each continue, goto retry and the end of the body will do the read > > > side splice() again, so get another chance to see the EOF > > > - That leaves just the breaks > > > > > > Consider each break (there are three, since patch 2 of this series) > > > if (written < 0) { > > > if (!conn->pending[fromsidei]) > > > break; > > > > > > (1) The pipe is empty and the write-splice returned EAGAIN, so it > > > didn't remove data from the pipe. > > > > You're assuming that !conn->pending[fromsidei] means that the pipe is > > empty. From what we see of it, it is. > > It does mean the pipe is empty. Everything we put in, we've taken > out. There cannot be anything in there. > > > What the kernel can do with it, though, is different. It might return > > EAGAIN even if we think we should have space, because it's resizing it > > under memory pressure or anything like that. Or it delays freeing up > > space or accounting for whatever reason. > > Theoretically, I suppose. But !pending doesn't just mean the pipe is > not full it means it's completely completely empty. Not being able to > put any bytes at all into an empty pipe would be *very* surprising. > So much so that if it happened in practice, I suspect we wouldn't be > safe not having epoll events on the pipe ends, so that we can be > notified when it deigns to accept some data. We can get 512-byte pipes, I actually saw that happening in practice with either: - people setting low values for ulimits - the user (or just pasta itself) having a lot of pipes open and if I recall correctly that's where I saw the case of a supposedly empty pipe giving us EAGAIN. That was years ago though and I didn't specifically fix that. We currently probe the size based on the value we can have for 32 pipes (TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get rather small pipes. Things might already be broken with them, I haven't checked the behaviour in a long while. I think 512 bytes was the lower bound I hit. > > So it would be nice to make this part robust to that. I thought setting > > FIN_RCVD on EPOLLRDHUP was a good way to achieve that. > > > > > Therefore, the pipe must have been > > > empty before the write-splice. Which means the read-splice can't have > > > blocked on a full pipe. > > > conn_event(conn, OUT_WAIT(!fromsidei)); > > > break; > > > } > > > > > > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it > > > must have blocked on the output socket. We've set OUT_WAIT(), so > > > we'll get an EPOLLOUT at some point which will cause us to read-splice > > > again, meaning we get another chance to see the EOF. > > > > ...later. But what if we don't get a zero-sized read *at all*? I'm not > > sure if splice() guarantees we do get one if we reach end-of-file. > > > That's something valid and very well established for read() and recv(), > > but splice() is a bit weird. The documentation says: > > > > A return value of 0 means end of input. > > > > but I wouldn't assume we'll *always* get at least one in case of EOF. > > What else could we plausibly get? -1 with EBADF, probably with EPOLLERR, because something timed out? But I guess you're right, as long as we're not in the EPOLLERR category of things, we should consistently get 0, even if we read multiple times. I had in mind some kernel behaviour where we get 0 once, and then -1 (EAGAIN?) because... go figure. But no, it can't happen. > > > [...] > > > if (conn->events & FIN_RCVD(fromsidei)) > > > break; > > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF. > > > > > > > The existing implementation distinguishes between end-of-file we hit in > > > > a given iteration, and EPOLLRDHUP we might have seen at any time. > > > > That was actually intended. > > > > > > It might be intended, but I can't see that we did anything with that > > > information. > > > > We always set FIN_RCVD on it. You're right, if we only checked that on > > 'eof', that didn't solve much, but that wasn't necessarily intended. My > > original intention was to make setting of FIN_RCVD (or whatever it was > > originally) robust. > > Ok, well. I've spotted other changes to make in the vicinity that I > think will make some of this easier to reason about anyway. So I'll > consider your points as I rework this and other patches. > > > > That said the conditions on which we exit / retry this loop are pretty > > > darn confusing. I'll see if I can improve them. > > > > -- > > Stefano -- Stefano