* [PATCH 0/6] Fix race condition while closing spliced connections
@ 2026-05-20 13:08 David Gibson
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
` (5 more replies)
0 siblings, 6 replies; 27+ messages in thread
From: David Gibson @ 2026-05-20 13:08 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: Paul Holzinger, David Gibson
Fix bug 202, where a race condition could cause connections to be
incorrectly reset in certain circumstances.
Patch 2/6 is the bug fix proper. 1/6 improves error reporting and
debugging messages in the vicinity. Patches 3..6/6 are some cleanups
I noticed in the area while working on the fix.
Link: https://bugs.passt.top/show_bug.cgi?id=202
David Gibson (6):
tcp_splice: Improve error reporting
tcp_splice: Avoid missing EOF recognition while forwarding
tcp_splice: Clean up flow control path for splice forwarding
tcp_splice: Simplify tracking of read/written bytes
tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
tcp_splice: Simplify shutdown(2) handling
tcp_conn.h | 6 +-
tcp_splice.c | 180 +++++++++++++++++++++++++++------------------------
2 files changed, 97 insertions(+), 89 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/6] tcp_splice: Improve error reporting
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
@ 2026-05-20 13:08 ` David Gibson
2026-05-20 14:31 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding David Gibson
` (4 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-20 13:08 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: Paul Holzinger, David Gibson
A number of things can, at least theoretically, go wrong when forwarding
data across a spliced connection. We generally handle this by resetting
the connection on both sides. However, in many cases we don't log any
message about why the connection was reset, which can make it hard to
debug why this is happening.
Add a bunch of debug and error logging to make this easier to figure out.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 42ee8abc..1359d6b8 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -502,15 +502,18 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
if (rc)
flow_perror(conn, "Error retrieving SO_ERROR");
else
- flow_trace(conn, "Error event on socket: %s",
- strerror_(err));
-
+ flow_dbg(conn, "Error event on %s socket: %s",
+ pif_name(conn->f.pif[evsidei]),
+ strerror_(err));
goto reset;
}
if (conn->events == SPLICE_CONNECT) {
- if (!(events & EPOLLOUT))
+ if (!(events & EPOLLOUT)) {
+ flow_err(conn, "Unexpected events 0x%x during connect",
+ events);
goto reset;
+ }
if (tcp_splice_connect_finish(c, conn))
goto reset;
}
@@ -545,8 +548,11 @@ retry:
SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
while (readlen < 0 && errno == EINTR);
- if (readlen < 0 && errno != EAGAIN)
+ if (readlen < 0 && errno != EAGAIN) {
+ flow_perror(conn, "Splicing from %s socket",
+ pif_name(conn->f.pif[fromsidei]));
goto reset;
+ }
flow_trace(conn, "%zi from read-side call", readlen);
@@ -569,8 +575,11 @@ retry:
SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
while (written < 0 && errno == EINTR);
- if (written < 0 && errno != EAGAIN)
+ if (written < 0 && errno != EAGAIN) {
+ flow_perror(conn, "Splicing to %s socket",
+ pif_name(conn->f.pif[!fromsidei]));
goto reset;
+ }
flow_trace(conn, "%zi from write-side call (passed %zi)",
written, c->tcp.pipe_size);
@@ -627,8 +636,11 @@ retry:
flow_foreach_sidei(sidei) {
if ((conn->events & FIN_RCVD(sidei)) &&
!(conn->events & FIN_SENT(!sidei))) {
- if (shutdown(conn->s[!sidei], SHUT_WR) < 0)
+ if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
+ flow_perror(conn, "shutdown() on %s",
+ pif_name(conn->f.pif[!sidei]));
goto reset;
+ }
conn_event(conn, FIN_SENT(!sidei));
}
}
@@ -647,8 +659,11 @@ retry:
goto swap;
}
- if (events & EPOLLHUP)
+ if (events & EPOLLHUP) {
+ flow_dbg(conn, "Hangup from %s socket",
+ pif_name(conn->f.pif[evsidei]));
goto reset;
+ }
return;
--
2.54.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
@ 2026-05-20 13:08 ` David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding David Gibson
` (3 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-20 13:08 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: Paul Holzinger, David Gibson
tcp_splice_sock_handler() has an optimised path for the common case where
the amount we splice(2) into the pipe is exactly the same as the amount we
splice(2) out again. If the pipe is empty at that point, we stop
forwarding until we get another epoll event.
However, via a subtle chain of events, this can cause a bug for a
half-closed connection. Suppose the connection is already half-closed in
the other direction - that is, we've already called shutdown(SHUT_WR) on
the socket for which we're getting the event. In this event we're getting
the last batch of data in the other direction, and also a FIN. This can
result in EPOLLIN, EPOLLRDHUP and EPOLLHUP events simultaneously.
We read the last data from the socket and successfully splice it to the
other side. Since there is no data in the pipe, we exit the forwarding
loop. However, because we did read data, we don't set the eof flag.
Because we don't set eof, we don't (yet) propagate the FIN to the other
side, or set FIN_SENT_(!fromsidei). Therefore we don't (yet) recognize
this as a clean termination and set the CLOSING flag. We would correct
this when we get our next event, however before we can do so we process
the EPOLLHUP event. Because we haven't recognized this as a clean close
we assume it is an abrupt close and send an RST to the other side.
To avoid this, don't stop attempting to forward data on this path.
Continue for at least one more loop. If we're at EOF, we'll recognize it
on the next splice(2). If not it gives us an opportunity to forward more
data without returning to the mail epoll loop.
Link: https://bugs.passt.top/show_bug.cgi?id=202
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 1359d6b8..34ffea73 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -605,7 +605,7 @@ retry:
}
}
- break;
+ continue;
}
conn->read[fromsidei] += readlen > 0 ? readlen : 0;
--
2.54.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
2026-05-20 13:08 ` [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding David Gibson
@ 2026-05-20 13:08 ` David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes David Gibson
` (2 subsequent siblings)
5 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-20 13:08 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: Paul Holzinger, David Gibson
Splice forwarding can be blocked either waiting for data from one side
or waiting for space on the other. For that reason,
tcp_splice_sock_handler() on either socket can forward data in either or
both directions, depending on whether we have EPOLLIN, EPOLLOUT or both
events.
The flow control for this is quite hard to follow though, since we forward
in one direction, then sometimes loop back with a goto to do it in the
other direction. Simplify this by adding a tcp_splice_forward() function
with the logic to forward in one direction and calling it either once or
twice from tcp_splice_sock_handler().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 137 ++++++++++++++++++++++++++-------------------------
1 file changed, 71 insertions(+), 66 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 34ffea73..18e8b303 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -474,67 +474,20 @@ void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0)
}
/**
- * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
+ * tcp_splice_forward() - Forward data in one direction using splice()
* @c: Execution context
- * @ref: epoll reference
- * @events: epoll events bitmap
+ * @conn: Connection to forward data for
+ * @fromsidei: Side to forward data from
*
* #syscalls:pasta splice
*/
-void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
- uint32_t events)
+static int tcp_splice_forward(struct ctx *c, struct
+ tcp_splice_conn *conn, unsigned fromsidei)
{
- struct tcp_splice_conn *conn = conn_at_sidx(ref.flowside);
- unsigned evsidei = ref.flowside.sidei, fromsidei;
- uint8_t lowat_set_flag, lowat_act_flag;
- int eof, never_read;
-
- assert(conn->f.type == FLOW_TCP_SPLICE);
-
- if (conn->events == SPLICE_CLOSED)
- return;
-
- if (events & EPOLLERR) {
- int err, rc;
- socklen_t sl = sizeof(err);
-
- rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
- if (rc)
- flow_perror(conn, "Error retrieving SO_ERROR");
- else
- flow_dbg(conn, "Error event on %s socket: %s",
- pif_name(conn->f.pif[evsidei]),
- strerror_(err));
- goto reset;
- }
-
- if (conn->events == SPLICE_CONNECT) {
- if (!(events & EPOLLOUT)) {
- flow_err(conn, "Unexpected events 0x%x during connect",
- events);
- goto reset;
- }
- if (tcp_splice_connect_finish(c, conn))
- goto reset;
- }
-
- if (events & EPOLLOUT) {
- fromsidei = !evsidei;
- conn_event(conn, ~OUT_WAIT(evsidei));
- } else {
- fromsidei = evsidei;
- }
-
- if (events & EPOLLRDHUP)
- /* For side 0 this is fake, but implied */
- conn_event(conn, FIN_RCVD(evsidei));
-
-swap:
- eof = 0;
- never_read = 1;
-
- lowat_set_flag = RCVLOWAT_SET(fromsidei);
- lowat_act_flag = RCVLOWAT_ACT(fromsidei);
+ uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
+ uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
+ int never_read = 1;
+ int eof = 0;
while (1) {
ssize_t readlen, written, pending;
@@ -551,7 +504,7 @@ retry:
if (readlen < 0 && errno != EAGAIN) {
flow_perror(conn, "Splicing from %s socket",
pif_name(conn->f.pif[fromsidei]));
- goto reset;
+ return -1;
}
flow_trace(conn, "%zi from read-side call", readlen);
@@ -578,7 +531,7 @@ retry:
if (written < 0 && errno != EAGAIN) {
flow_perror(conn, "Splicing to %s socket",
pif_name(conn->f.pif[!fromsidei]));
- goto reset;
+ return -1;
}
flow_trace(conn, "%zi from write-side call (passed %zi)",
@@ -639,24 +592,76 @@ retry:
if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
flow_perror(conn, "shutdown() on %s",
pif_name(conn->f.pif[!sidei]));
- goto reset;
+ return -1;
}
conn_event(conn, FIN_SENT(!sidei));
}
}
}
- if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1))) {
- /* Clean close, no reset */
- conn_flag(conn, CLOSING);
+ return 0;
+}
+
+/**
+ * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
+ * @c: Execution context
+ * @ref: epoll reference
+ * @events: epoll events bitmap
+ */
+void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
+ uint32_t events)
+{
+ struct tcp_splice_conn *conn = conn_at_sidx(ref.flowside);
+ unsigned evsidei = ref.flowside.sidei;
+
+ assert(conn->f.type == FLOW_TCP_SPLICE);
+
+ if (conn->events == SPLICE_CLOSED)
return;
+
+ if (events & EPOLLERR) {
+ int err, rc;
+ socklen_t sl = sizeof(err);
+
+ rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
+ if (rc)
+ flow_perror(conn, "Error retrieving SO_ERROR");
+ else
+ flow_dbg(conn, "Error event on %s socket: %s",
+ pif_name(conn->f.pif[evsidei]),
+ strerror_(err));
+ goto reset;
+ }
+
+ if (conn->events == SPLICE_CONNECT) {
+ if (!(events & EPOLLOUT)) {
+ flow_err(conn, "Unexpected events 0x%x during connect",
+ events);
+ goto reset;
+ }
+ if (tcp_splice_connect_finish(c, conn))
+ goto reset;
+ }
+
+ if (events & EPOLLRDHUP)
+ /* For side 0 this is fake, but implied */
+ conn_event(conn, FIN_RCVD(evsidei));
+
+ if (events & EPOLLOUT) {
+ if (tcp_splice_forward(c, conn, !evsidei))
+ goto reset;
+ conn_event(conn, ~OUT_WAIT(evsidei));
}
- if ((events & (EPOLLIN | EPOLLOUT)) == (EPOLLIN | EPOLLOUT)) {
- events = EPOLLIN;
+ if (events & EPOLLIN) {
+ if (tcp_splice_forward(c, conn, evsidei))
+ goto reset;
+ }
- fromsidei = !fromsidei;
- goto swap;
+ if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1))) {
+ /* Clean close, no reset */
+ conn_flag(conn, CLOSING);
+ return;
}
if (events & EPOLLHUP) {
--
2.54.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
` (2 preceding siblings ...)
2026-05-20 13:08 ` [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding David Gibson
@ 2026-05-20 13:08 ` David Gibson
2026-05-20 20:29 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
2026-05-20 13:08 ` [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling David Gibson
5 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-20 13:08 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: Paul Holzinger, David Gibson
For each each direction of each spliced connection, we keep track of how
many bytes we've read from one socket and written to the other. However,
we never actually care about the absolute values of these, only the
difference between them, which represents how much data is currently "in
flight" in the splicing pipe.
Simplify the handling by having a single variable tracking the number of
bytes in the pipe.
As a bonus, the new scheme makes it clearer that we don't need to worry
about overflows: pending can never become larger than the maximum pipe
bufffer size, well within 32-bits.
I _think_ the old scheme was safe in the case of overflow - again under
the assumption that read/written can never be further apart than the pipe
buffer size. However, it's much harder to reason about this case. It's
certainly plausible that an overflow could occur - sending 4GiB through
a local socket is entirely achievable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_conn.h | 6 ++----
tcp_splice.c | 18 +++++++++---------
2 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/tcp_conn.h b/tcp_conn.h
index 9f5bee03..c8381aa7 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -206,8 +206,7 @@ struct tcp_tap_transfer_ext {
* @f: Generic flow information
* @s: File descriptor for sockets
* @pipe: File descriptors for pipes
- * @read: Bytes read (not fully written to other side in one shot)
- * @written: Bytes written (not fully written from one other side read)
+ * @pending: Bytes currently in each pipe
* @events: Events observed/actions performed on connection
* @flags: Connection flags (attributes, not events)
*/
@@ -218,8 +217,7 @@ struct tcp_splice_conn {
int s[SIDES];
int pipe[SIDES][2];
- uint32_t read[SIDES];
- uint32_t written[SIDES];
+ uint32_t pending[SIDES];
uint8_t events;
#define SPLICE_CLOSED 0
diff --git a/tcp_splice.c b/tcp_splice.c
index 18e8b303..8fbd490f 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -292,7 +292,7 @@ bool tcp_splice_flow_defer(struct tcp_splice_conn *conn)
conn->s[sidei] = -1;
}
- conn->read[sidei] = conn->written[sidei] = 0;
+ conn->pending[sidei] = 0;
}
conn->events = SPLICE_CLOSED;
@@ -490,7 +490,7 @@ static int tcp_splice_forward(struct ctx *c, struct
int eof = 0;
while (1) {
- ssize_t readlen, written, pending;
+ ssize_t readlen, written;
int more = 0;
retry:
@@ -537,7 +537,7 @@ retry:
flow_trace(conn, "%zi from write-side call (passed %zi)",
written, c->tcp.pipe_size);
- /* Most common case: skip updating counters. */
+ /* Most common case: skip updating pending. */
if (readlen > 0 && readlen == written) {
if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
continue;
@@ -561,11 +561,11 @@ retry:
continue;
}
- conn->read[fromsidei] += readlen > 0 ? readlen : 0;
- conn->written[fromsidei] += written > 0 ? written : 0;
+ conn->pending[fromsidei] += readlen > 0 ? readlen : 0;
+ conn->pending[fromsidei] -= written > 0 ? written : 0;
if (written < 0) {
- if (conn->read[fromsidei] == conn->written[fromsidei])
+ if (!conn->pending[fromsidei])
break;
conn_event(conn, OUT_WAIT(!fromsidei));
@@ -575,15 +575,15 @@ retry:
if (never_read && written == (long)(c->tcp.pipe_size))
goto retry;
- pending = conn->read[fromsidei] - conn->written[fromsidei];
- if (!never_read && written > 0 && written < pending)
+ if (!never_read && written > 0 &&
+ written < conn->pending[fromsidei])
goto retry;
if (eof)
break;
}
- if (conn->read[fromsidei] == conn->written[fromsidei] && eof) {
+ if (!conn->pending[fromsidei] && eof) {
unsigned sidei;
flow_foreach_sidei(sidei) {
--
2.54.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
` (3 preceding siblings ...)
2026-05-20 13:08 ` [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes David Gibson
@ 2026-05-20 13:08 ` David Gibson
2026-05-20 20:30 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling David Gibson
5 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-20 13:08 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: Paul Holzinger, David Gibson
There are two ways we can tell one of our sockets has received a FIN. We
can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
(EOF) on the socket. We currently use both, in a mildly confusing way:
we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
some other close out logic is based on seeing an EOF.
Simplify this by setting the flag based on only the EOF. To make sure we
don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
forwarding path for EPOLLRDHUP as well as EPOLLIN.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 8fbd490f..b45f0060 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
int never_read = 1;
- int eof = 0;
while (1) {
ssize_t readlen, written;
@@ -510,7 +509,7 @@ retry:
flow_trace(conn, "%zi from read-side call", readlen);
if (!readlen) {
- eof = 1;
+ conn_event(conn, FIN_RCVD(fromsidei));
} else if (readlen > 0) {
never_read = 0;
@@ -579,11 +578,12 @@ retry:
written < conn->pending[fromsidei])
goto retry;
- if (eof)
+ if (conn->events & FIN_RCVD(fromsidei))
break;
}
- if (!conn->pending[fromsidei] && eof) {
+ if (!conn->pending[fromsidei] &&
+ conn->events & FIN_RCVD(fromsidei)) {
unsigned sidei;
flow_foreach_sidei(sidei) {
@@ -643,17 +643,13 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
goto reset;
}
- if (events & EPOLLRDHUP)
- /* For side 0 this is fake, but implied */
- conn_event(conn, FIN_RCVD(evsidei));
-
if (events & EPOLLOUT) {
if (tcp_splice_forward(c, conn, !evsidei))
goto reset;
conn_event(conn, ~OUT_WAIT(evsidei));
}
- if (events & EPOLLIN) {
+ if (events & (EPOLLIN | EPOLLRDHUP)) {
if (tcp_splice_forward(c, conn, evsidei))
goto reset;
}
--
2.54.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
` (4 preceding siblings ...)
2026-05-20 13:08 ` [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
@ 2026-05-20 13:08 ` David Gibson
2026-05-20 20:30 ` Stefano Brivio
5 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-20 13:08 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: Paul Holzinger, David Gibson
At the end of tcp_splice_forward(), we check for half-closed connections
and propagate the FIN to the other side with a shutdown(2). Currently we
check for a half closed connection in either direction. That's unnecessary
here, because tcp_splice_forward() will already be called for each
direction if there are any relevant events.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index b45f0060..e5018f2e 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -582,21 +582,15 @@ retry:
break;
}
- if (!conn->pending[fromsidei] &&
- conn->events & FIN_RCVD(fromsidei)) {
- unsigned sidei;
-
- flow_foreach_sidei(sidei) {
- if ((conn->events & FIN_RCVD(sidei)) &&
- !(conn->events & FIN_SENT(!sidei))) {
- if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
- flow_perror(conn, "shutdown() on %s",
- pif_name(conn->f.pif[!sidei]));
- return -1;
- }
- conn_event(conn, FIN_SENT(!sidei));
- }
+ if ((conn->events & FIN_RCVD(fromsidei)) &&
+ !(conn->events & FIN_SENT(!fromsidei)) &&
+ !conn->pending[fromsidei]) {
+ if (shutdown(conn->s[!fromsidei], SHUT_WR) < 0) {
+ flow_perror(conn, "shutdown() on %s",
+ pif_name(conn->f.pif[!fromsidei]));
+ return -1;
}
+ conn_event(conn, FIN_SENT(!fromsidei));
}
return 0;
--
2.54.0
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] tcp_splice: Improve error reporting
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
@ 2026-05-20 14:31 ` Stefano Brivio
2026-05-21 0:43 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-20 14:31 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger, Anshu Kumari
On Wed, 20 May 2026 23:08:46 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> A number of things can, at least theoretically, go wrong when forwarding
> data across a spliced connection. We generally handle this by resetting
> the connection on both sides. However, in many cases we don't log any
> message about why the connection was reset, which can make it hard to
> debug why this is happening.
>
> Add a bunch of debug and error logging to make this easier to figure out.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 31 +++++++++++++++++++++++--------
> 1 file changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 42ee8abc..1359d6b8 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -502,15 +502,18 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> if (rc)
> flow_perror(conn, "Error retrieving SO_ERROR");
> else
> - flow_trace(conn, "Error event on socket: %s",
> - strerror_(err));
> -
> + flow_dbg(conn, "Error event on %s socket: %s",
> + pif_name(conn->f.pif[evsidei]),
> + strerror_(err));
> goto reset;
> }
>
> if (conn->events == SPLICE_CONNECT) {
> - if (!(events & EPOLLOUT))
> + if (!(events & EPOLLOUT)) {
> + flow_err(conn, "Unexpected events 0x%x during connect",
> + events);
Shouldn't all the flow_err() and flow_perror() calls here be
ratelimited, that is, eventually calling the err_ratelimit() function
Anshu introduced recently?
We don't have helpers ready for flow_err() and flow_perror(), I was
about to post a patch that would go before this series but I'm not sure
if there's a specific reason to avoid those.
> goto reset;
> + }
> if (tcp_splice_connect_finish(c, conn))
> goto reset;
> }
> @@ -545,8 +548,11 @@ retry:
> SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
> while (readlen < 0 && errno == EINTR);
>
> - if (readlen < 0 && errno != EAGAIN)
> + if (readlen < 0 && errno != EAGAIN) {
> + flow_perror(conn, "Splicing from %s socket",
> + pif_name(conn->f.pif[fromsidei]));
> goto reset;
> + }
>
> flow_trace(conn, "%zi from read-side call", readlen);
>
> @@ -569,8 +575,11 @@ retry:
> SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
> while (written < 0 && errno == EINTR);
>
> - if (written < 0 && errno != EAGAIN)
> + if (written < 0 && errno != EAGAIN) {
> + flow_perror(conn, "Splicing to %s socket",
> + pif_name(conn->f.pif[!fromsidei]));
> goto reset;
> + }
>
> flow_trace(conn, "%zi from write-side call (passed %zi)",
> written, c->tcp.pipe_size);
> @@ -627,8 +636,11 @@ retry:
> flow_foreach_sidei(sidei) {
> if ((conn->events & FIN_RCVD(sidei)) &&
> !(conn->events & FIN_SENT(!sidei))) {
> - if (shutdown(conn->s[!sidei], SHUT_WR) < 0)
> + if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> + flow_perror(conn, "shutdown() on %s",
> + pif_name(conn->f.pif[!sidei]));
> goto reset;
> + }
> conn_event(conn, FIN_SENT(!sidei));
> }
> }
> @@ -647,8 +659,11 @@ retry:
> goto swap;
> }
>
> - if (events & EPOLLHUP)
> + if (events & EPOLLHUP) {
> + flow_dbg(conn, "Hangup from %s socket",
> + pif_name(conn->f.pif[evsidei]));
> goto reset;
> + }
>
> return;
>
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding
2026-05-20 13:08 ` [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding David Gibson
@ 2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:46 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-20 20:28 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Wed, 20 May 2026 23:08:47 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> tcp_splice_sock_handler() has an optimised path for the common case where
> the amount we splice(2) into the pipe is exactly the same as the amount we
> splice(2) out again. If the pipe is empty at that point, we stop
> forwarding until we get another epoll event.
>
> However, via a subtle chain of events, this can cause a bug for a
> half-closed connection. Suppose the connection is already half-closed in
> the other direction - that is, we've already called shutdown(SHUT_WR) on
> the socket for which we're getting the event. In this event we're getting
> the last batch of data in the other direction, and also a FIN. This can
> result in EPOLLIN, EPOLLRDHUP and EPOLLHUP events simultaneously.
>
> We read the last data from the socket and successfully splice it to the
> other side. Since there is no data in the pipe, we exit the forwarding
> loop. However, because we did read data, we don't set the eof flag.
>
> Because we don't set eof, we don't (yet) propagate the FIN to the other
> side, or set FIN_SENT_(!fromsidei). Therefore we don't (yet) recognize
> this as a clean termination and set the CLOSING flag. We would correct
> this when we get our next event, however before we can do so we process
> the EPOLLHUP event. Because we haven't recognized this as a clean close
> we assume it is an abrupt close and send an RST to the other side.
>
> To avoid this, don't stop attempting to forward data on this path.
> Continue for at least one more loop. If we're at EOF, we'll recognize it
> on the next splice(2). If not it gives us an opportunity to forward more
> data without returning to the mail epoll loop.
Oops. The fix looks correct to me, but I wonder: is it clear to you why
the issue only started occurring in this release? This code had "always"
been there.
I see a few possible directions but I'm not quite sure. Not that
important anyway, if you could reproduce the issue and this fixes it.
Just one nit:
> Link: https://bugs.passt.top/show_bug.cgi?id=202
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reported-by: Paul Holzinger <pholzing@redhat.com>
> ---
> tcp_splice.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 1359d6b8..34ffea73 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -605,7 +605,7 @@ retry:
> }
> }
>
> - break;
> + continue;
> }
>
> conn->read[fromsidei] += readlen > 0 ? readlen : 0;
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding
2026-05-20 13:08 ` [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding David Gibson
@ 2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:50 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-20 20:28 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
Ah, yes, it looks better now. Three remarks:
On Wed, 20 May 2026 23:08:48 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> Splice forwarding can be blocked either waiting for data from one side
> or waiting for space on the other. For that reason,
> tcp_splice_sock_handler() on either socket can forward data in either or
> both directions, depending on whether we have EPOLLIN, EPOLLOUT or both
> events.
>
> The flow control for this is quite hard to follow though, since we forward
> in one direction, then sometimes loop back with a goto to do it in the
> other direction. Simplify this by adding a tcp_splice_forward() function
> with the logic to forward in one direction and calling it either once or
> twice from tcp_splice_sock_handler().
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 137 ++++++++++++++++++++++++++-------------------------
> 1 file changed, 71 insertions(+), 66 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 34ffea73..18e8b303 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -474,67 +474,20 @@ void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0)
> }
>
> /**
> - * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
> + * tcp_splice_forward() - Forward data in one direction using splice()
> * @c: Execution context
> - * @ref: epoll reference
> - * @events: epoll events bitmap
> + * @conn: Connection to forward data for
> + * @fromsidei: Side to forward data from
> *
> * #syscalls:pasta splice
> */
> -void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> - uint32_t events)
> +static int tcp_splice_forward(struct ctx *c, struct
> + tcp_splice_conn *conn, unsigned fromsidei)
I think the struct
argument should all be on the same line.
> {
> - struct tcp_splice_conn *conn = conn_at_sidx(ref.flowside);
> - unsigned evsidei = ref.flowside.sidei, fromsidei;
> - uint8_t lowat_set_flag, lowat_act_flag;
> - int eof, never_read;
> -
> - assert(conn->f.type == FLOW_TCP_SPLICE);
> -
> - if (conn->events == SPLICE_CLOSED)
> - return;
> -
> - if (events & EPOLLERR) {
> - int err, rc;
> - socklen_t sl = sizeof(err);
> -
> - rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
> - if (rc)
> - flow_perror(conn, "Error retrieving SO_ERROR");
> - else
> - flow_dbg(conn, "Error event on %s socket: %s",
> - pif_name(conn->f.pif[evsidei]),
> - strerror_(err));
> - goto reset;
> - }
> -
> - if (conn->events == SPLICE_CONNECT) {
> - if (!(events & EPOLLOUT)) {
> - flow_err(conn, "Unexpected events 0x%x during connect",
> - events);
> - goto reset;
> - }
> - if (tcp_splice_connect_finish(c, conn))
> - goto reset;
> - }
> -
> - if (events & EPOLLOUT) {
> - fromsidei = !evsidei;
> - conn_event(conn, ~OUT_WAIT(evsidei));
> - } else {
> - fromsidei = evsidei;
> - }
> -
> - if (events & EPOLLRDHUP)
> - /* For side 0 this is fake, but implied */
> - conn_event(conn, FIN_RCVD(evsidei));
> -
> -swap:
> - eof = 0;
> - never_read = 1;
> -
> - lowat_set_flag = RCVLOWAT_SET(fromsidei);
> - lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> + uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> + uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> + int never_read = 1;
> + int eof = 0;
>
> while (1) {
> ssize_t readlen, written, pending;
> @@ -551,7 +504,7 @@ retry:
> if (readlen < 0 && errno != EAGAIN) {
> flow_perror(conn, "Splicing from %s socket",
> pif_name(conn->f.pif[fromsidei]));
> - goto reset;
> + return -1;
> }
>
> flow_trace(conn, "%zi from read-side call", readlen);
> @@ -578,7 +531,7 @@ retry:
> if (written < 0 && errno != EAGAIN) {
> flow_perror(conn, "Splicing to %s socket",
> pif_name(conn->f.pif[!fromsidei]));
> - goto reset;
> + return -1;
> }
>
> flow_trace(conn, "%zi from write-side call (passed %zi)",
> @@ -639,24 +592,76 @@ retry:
> if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> flow_perror(conn, "shutdown() on %s",
> pif_name(conn->f.pif[!sidei]));
> - goto reset;
> + return -1;
> }
> conn_event(conn, FIN_SENT(!sidei));
> }
> }
> }
>
> - if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1))) {
> - /* Clean close, no reset */
> - conn_flag(conn, CLOSING);
> + return 0;
> +}
> +
> +/**
> + * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
> + * @c: Execution context
> + * @ref: epoll reference
> + * @events: epoll events bitmap
> + */
> +void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> + uint32_t events)
> +{
> + struct tcp_splice_conn *conn = conn_at_sidx(ref.flowside);
> + unsigned evsidei = ref.flowside.sidei;
> +
> + assert(conn->f.type == FLOW_TCP_SPLICE);
> +
> + if (conn->events == SPLICE_CLOSED)
> return;
> +
> + if (events & EPOLLERR) {
> + int err, rc;
> + socklen_t sl = sizeof(err);
> +
> + rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
> + if (rc)
> + flow_perror(conn, "Error retrieving SO_ERROR");
> + else
> + flow_dbg(conn, "Error event on %s socket: %s",
> + pif_name(conn->f.pif[evsidei]),
> + strerror_(err));
> + goto reset;
> + }
> +
> + if (conn->events == SPLICE_CONNECT) {
> + if (!(events & EPOLLOUT)) {
> + flow_err(conn, "Unexpected events 0x%x during connect",
> + events);
> + goto reset;
> + }
> + if (tcp_splice_connect_finish(c, conn))
> + goto reset;
> + }
> +
> + if (events & EPOLLRDHUP)
> + /* For side 0 this is fake, but implied */
> + conn_event(conn, FIN_RCVD(evsidei));
I saw this all goes away in 5/6, so it wouldn't be relevant. But in
case we decide to drop 5/6, here are my remarks on the this.
EPOLLRDHUP is now handled before checking the other direction of the
connection in case of EPOLLOUT.
I think it actually makes more sense this way because we update flags
with everything we know until that point, and it shouldn't have a
functional effect (the check at the end of the new tcp_splice_forward()
is on FIN_RCVD(fromsidei)), but I'm raising that in case the change
wasn't intended.
> +
> + if (events & EPOLLOUT) {
> + if (tcp_splice_forward(c, conn, !evsidei))
> + goto reset;
> + conn_event(conn, ~OUT_WAIT(evsidei));
> }
>
> - if ((events & (EPOLLIN | EPOLLOUT)) == (EPOLLIN | EPOLLOUT)) {
> - events = EPOLLIN;
> + if (events & EPOLLIN) {
> + if (tcp_splice_forward(c, conn, evsidei))
> + goto reset;
This should be:
goto reset;
instead of:
goto reset;
> + }
>
> - fromsidei = !fromsidei;
> - goto swap;
> + if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1))) {
> + /* Clean close, no reset */
> + conn_flag(conn, CLOSING);
> + return;
> }
>
> if (events & EPOLLHUP) {
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes
2026-05-20 13:08 ` [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes David Gibson
@ 2026-05-20 20:29 ` Stefano Brivio
2026-05-21 0:54 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-20 20:29 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Wed, 20 May 2026 23:08:49 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> For each each direction of each spliced connection, we keep track of how
> many bytes we've read from one socket and written to the other. However,
> we never actually care about the absolute values of these, only the
> difference between them, which represents how much data is currently "in
> flight" in the splicing pipe.
>
> Simplify the handling by having a single variable tracking the number of
> bytes in the pipe.
For me it actually looks slightly more complicated to think about it
this way, I added explicit 'read' and 'written' after being bitten by
some issue I introduced with a previous 'pending' concept, but I have
to admit it slightly simplifies the overflow topic.
> As a bonus, the new scheme makes it clearer that we don't need to worry
> about overflows: pending can never become larger than the maximum pipe
> bufffer size, well within 32-bits.
>
> I _think_ the old scheme was safe in the case of overflow - again under
> the assumption that read/written can never be further apart than the pipe
> buffer size. However, it's much harder to reason about this case. It's
> certainly plausible that an overflow could occur - sending 4GiB through
> a local socket is entirely achievable.
For me it looked pretty simple: you can overflow 32 bits (at 100 Gbps,
but without hitting the "optimised" case, it would take about five
minutes), but all the operations between the two counters are between
two uint32_t, so they would happen in uint32_t, hence modulo 32 bits,
similar to TCP sequences.
Anyway, overall, I think it's an improvement over the original. One nit
here:
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_conn.h | 6 ++----
> tcp_splice.c | 18 +++++++++---------
> 2 files changed, 11 insertions(+), 13 deletions(-)
>
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 9f5bee03..c8381aa7 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -206,8 +206,7 @@ struct tcp_tap_transfer_ext {
> * @f: Generic flow information
> * @s: File descriptor for sockets
> * @pipe: File descriptors for pipes
> - * @read: Bytes read (not fully written to other side in one shot)
> - * @written: Bytes written (not fully written from one other side read)
> + * @pending: Bytes currently in each pipe
> * @events: Events observed/actions performed on connection
> * @flags: Connection flags (attributes, not events)
> */
> @@ -218,8 +217,7 @@ struct tcp_splice_conn {
> int s[SIDES];
> int pipe[SIDES][2];
>
> - uint32_t read[SIDES];
> - uint32_t written[SIDES];
> + uint32_t pending[SIDES];
>
> uint8_t events;
> #define SPLICE_CLOSED 0
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 18e8b303..8fbd490f 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -292,7 +292,7 @@ bool tcp_splice_flow_defer(struct tcp_splice_conn *conn)
> conn->s[sidei] = -1;
> }
>
> - conn->read[sidei] = conn->written[sidei] = 0;
> + conn->pending[sidei] = 0;
> }
>
> conn->events = SPLICE_CLOSED;
> @@ -490,7 +490,7 @@ static int tcp_splice_forward(struct ctx *c, struct
> int eof = 0;
>
> while (1) {
> - ssize_t readlen, written, pending;
> + ssize_t readlen, written;
> int more = 0;
>
> retry:
> @@ -537,7 +537,7 @@ retry:
> flow_trace(conn, "%zi from write-side call (passed %zi)",
> written, c->tcp.pipe_size);
>
> - /* Most common case: skip updating counters. */
> + /* Most common case: skip updating pending. */
"pending" isn't a noun (even though the variable name is, but it's
not quite obvious that you're referring to it). I think that:
/* Most common case: skip updating count of pending bytes */
would be slightly clearer (and also omit the '.' because it's not a
complete sentence, as we usually do on single-line comments, similarly
to most occurrences in the kernel).
> if (readlen > 0 && readlen == written) {
> if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
> continue;
> @@ -561,11 +561,11 @@ retry:
> continue;
> }
>
> - conn->read[fromsidei] += readlen > 0 ? readlen : 0;
> - conn->written[fromsidei] += written > 0 ? written : 0;
> + conn->pending[fromsidei] += readlen > 0 ? readlen : 0;
> + conn->pending[fromsidei] -= written > 0 ? written : 0;
>
> if (written < 0) {
> - if (conn->read[fromsidei] == conn->written[fromsidei])
> + if (!conn->pending[fromsidei])
> break;
>
> conn_event(conn, OUT_WAIT(!fromsidei));
> @@ -575,15 +575,15 @@ retry:
> if (never_read && written == (long)(c->tcp.pipe_size))
> goto retry;
>
> - pending = conn->read[fromsidei] - conn->written[fromsidei];
> - if (!never_read && written > 0 && written < pending)
> + if (!never_read && written > 0 &&
> + written < conn->pending[fromsidei])
> goto retry;
>
> if (eof)
> break;
> }
>
> - if (conn->read[fromsidei] == conn->written[fromsidei] && eof) {
> + if (!conn->pending[fromsidei] && eof) {
> unsigned sidei;
>
> flow_foreach_sidei(sidei) {
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-20 13:08 ` [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
@ 2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:03 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-20 20:30 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Wed, 20 May 2026 23:08:50 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> There are two ways we can tell one of our sockets has received a FIN. We
> can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> (EOF) on the socket. We currently use both, in a mildly confusing way:
> we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> some other close out logic is based on seeing an EOF.
>
> Simplify this by setting the flag based on only the EOF. To make sure we
> don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> forwarding path for EPOLLRDHUP as well as EPOLLIN.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 14 +++++---------
> 1 file changed, 5 insertions(+), 9 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 8fbd490f..b45f0060 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> int never_read = 1;
> - int eof = 0;
>
> while (1) {
> ssize_t readlen, written;
> @@ -510,7 +509,7 @@ retry:
> flow_trace(conn, "%zi from read-side call", readlen);
>
> if (!readlen) {
> - eof = 1;
> + conn_event(conn, FIN_RCVD(fromsidei));
I'm not sure if I really found a concrete issue with this, but it looks
a bit scary, because it changes the semantics of FIN_RCVD, which used to
mean that we infer we received a FIN, regardless of whether we're done
processing all data from that half of the connection.
Now FIN_RCVD is only set if we actually processed all the data and we
hit the end of file.
The (potential) issue I see here is that we get EPOLLRDHUP, splice()
returns -1 with EAGAIN in errno because we had no room in the pipe,
and it would have returned 0 instead.
Will we ever get our zero-sized "read" later? If not, we might have
missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
guarantees in that sense from splice().
The existing implementation distinguishes between end-of-file we hit in
a given iteration, and EPOLLRDHUP we might have seen at any time.
That was actually intended.
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling
2026-05-20 13:08 ` [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling David Gibson
@ 2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:11 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-20 20:30 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Wed, 20 May 2026 23:08:51 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> At the end of tcp_splice_forward(), we check for half-closed connections
> and propagate the FIN to the other side with a shutdown(2). Currently we
> check for a half closed connection in either direction. That's unnecessary
> here, because tcp_splice_forward() will already be called for each
> direction if there are any relevant events.
True, but do we have the guarantee that tcp_splice_forward() will also
be called once all relevant FIN_RCVD / FIN_SENT flags have been sent?
The reason why we check both sides here is that we might have updated
flags for one side, and now we need to double check if it's time to
call shutdown() as a consequence.
Maybe we never have to, but I think it's not really obvious to prove.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 22 ++++++++--------------
> 1 file changed, 8 insertions(+), 14 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index b45f0060..e5018f2e 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -582,21 +582,15 @@ retry:
> break;
> }
>
> - if (!conn->pending[fromsidei] &&
> - conn->events & FIN_RCVD(fromsidei)) {
> - unsigned sidei;
> -
> - flow_foreach_sidei(sidei) {
> - if ((conn->events & FIN_RCVD(sidei)) &&
> - !(conn->events & FIN_SENT(!sidei))) {
> - if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> - flow_perror(conn, "shutdown() on %s",
> - pif_name(conn->f.pif[!sidei]));
> - return -1;
> - }
> - conn_event(conn, FIN_SENT(!sidei));
> - }
> + if ((conn->events & FIN_RCVD(fromsidei)) &&
> + !(conn->events & FIN_SENT(!fromsidei)) &&
> + !conn->pending[fromsidei]) {
> + if (shutdown(conn->s[!fromsidei], SHUT_WR) < 0) {
> + flow_perror(conn, "shutdown() on %s",
> + pif_name(conn->f.pif[!fromsidei]));
> + return -1;
> }
> + conn_event(conn, FIN_SENT(!fromsidei));
> }
>
> return 0;
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] tcp_splice: Improve error reporting
2026-05-20 14:31 ` Stefano Brivio
@ 2026-05-21 0:43 ` David Gibson
2026-05-21 5:08 ` Stefano Brivio
0 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-21 0:43 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger, Anshu Kumari
[-- Attachment #1: Type: text/plain, Size: 4227 bytes --]
On Wed, May 20, 2026 at 04:31:34PM +0200, Stefano Brivio wrote:
> On Wed, 20 May 2026 23:08:46 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > A number of things can, at least theoretically, go wrong when forwarding
> > data across a spliced connection. We generally handle this by resetting
> > the connection on both sides. However, in many cases we don't log any
> > message about why the connection was reset, which can make it hard to
> > debug why this is happening.
> >
> > Add a bunch of debug and error logging to make this easier to figure out.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_splice.c | 31 +++++++++++++++++++++++--------
> > 1 file changed, 23 insertions(+), 8 deletions(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 42ee8abc..1359d6b8 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -502,15 +502,18 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> > if (rc)
> > flow_perror(conn, "Error retrieving SO_ERROR");
> > else
> > - flow_trace(conn, "Error event on socket: %s",
> > - strerror_(err));
> > -
> > + flow_dbg(conn, "Error event on %s socket: %s",
> > + pif_name(conn->f.pif[evsidei]),
> > + strerror_(err));
> > goto reset;
> > }
> >
> > if (conn->events == SPLICE_CONNECT) {
> > - if (!(events & EPOLLOUT))
> > + if (!(events & EPOLLOUT)) {
> > + flow_err(conn, "Unexpected events 0x%x during connect",
> > + events);
>
> Shouldn't all the flow_err() and flow_perror() calls here be
> ratelimited, that is, eventually calling the err_ratelimit() function
> Anshu introduced recently?
I did think about that, I concluded it wasn't necessary here because
it indicates that something has gone unexpectedly wrong at the kernel
level, it's not guest triggerable.
I can put in ratelimits if you still think they're necessary.
> We don't have helpers ready for flow_err() and flow_perror(), I was
> about to post a patch that would go before this series but I'm not sure
> if there's a specific reason to avoid those.
>
> > goto reset;
> > + }
> > if (tcp_splice_connect_finish(c, conn))
> > goto reset;
> > }
> > @@ -545,8 +548,11 @@ retry:
> > SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
> > while (readlen < 0 && errno == EINTR);
> >
> > - if (readlen < 0 && errno != EAGAIN)
> > + if (readlen < 0 && errno != EAGAIN) {
> > + flow_perror(conn, "Splicing from %s socket",
> > + pif_name(conn->f.pif[fromsidei]));
> > goto reset;
> > + }
> >
> > flow_trace(conn, "%zi from read-side call", readlen);
> >
> > @@ -569,8 +575,11 @@ retry:
> > SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
> > while (written < 0 && errno == EINTR);
> >
> > - if (written < 0 && errno != EAGAIN)
> > + if (written < 0 && errno != EAGAIN) {
> > + flow_perror(conn, "Splicing to %s socket",
> > + pif_name(conn->f.pif[!fromsidei]));
> > goto reset;
> > + }
> >
> > flow_trace(conn, "%zi from write-side call (passed %zi)",
> > written, c->tcp.pipe_size);
> > @@ -627,8 +636,11 @@ retry:
> > flow_foreach_sidei(sidei) {
> > if ((conn->events & FIN_RCVD(sidei)) &&
> > !(conn->events & FIN_SENT(!sidei))) {
> > - if (shutdown(conn->s[!sidei], SHUT_WR) < 0)
> > + if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> > + flow_perror(conn, "shutdown() on %s",
> > + pif_name(conn->f.pif[!sidei]));
> > goto reset;
> > + }
> > conn_event(conn, FIN_SENT(!sidei));
> > }
> > }
> > @@ -647,8 +659,11 @@ retry:
> > goto swap;
> > }
> >
> > - if (events & EPOLLHUP)
> > + if (events & EPOLLHUP) {
> > + flow_dbg(conn, "Hangup from %s socket",
> > + pif_name(conn->f.pif[evsidei]));
Except for this one, which is debug level for that reason.
> > goto reset;
> > + }
> >
> > return;
> >
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding
2026-05-20 20:28 ` Stefano Brivio
@ 2026-05-21 0:46 ` David Gibson
0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2026-05-21 0:46 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 3325 bytes --]
On Wed, May 20, 2026 at 10:28:36PM +0200, Stefano Brivio wrote:
> On Wed, 20 May 2026 23:08:47 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > tcp_splice_sock_handler() has an optimised path for the common case where
> > the amount we splice(2) into the pipe is exactly the same as the amount we
> > splice(2) out again. If the pipe is empty at that point, we stop
> > forwarding until we get another epoll event.
> >
> > However, via a subtle chain of events, this can cause a bug for a
> > half-closed connection. Suppose the connection is already half-closed in
> > the other direction - that is, we've already called shutdown(SHUT_WR) on
> > the socket for which we're getting the event. In this event we're getting
> > the last batch of data in the other direction, and also a FIN. This can
> > result in EPOLLIN, EPOLLRDHUP and EPOLLHUP events simultaneously.
> >
> > We read the last data from the socket and successfully splice it to the
> > other side. Since there is no data in the pipe, we exit the forwarding
> > loop. However, because we did read data, we don't set the eof flag.
> >
> > Because we don't set eof, we don't (yet) propagate the FIN to the other
> > side, or set FIN_SENT_(!fromsidei). Therefore we don't (yet) recognize
> > this as a clean termination and set the CLOSING flag. We would correct
> > this when we get our next event, however before we can do so we process
> > the EPOLLHUP event. Because we haven't recognized this as a clean close
> > we assume it is an abrupt close and send an RST to the other side.
> >
> > To avoid this, don't stop attempting to forward data on this path.
> > Continue for at least one more loop. If we're at EOF, we'll recognize it
> > on the next splice(2). If not it gives us an opportunity to forward more
> > data without returning to the mail epoll loop.
>
> Oops. The fix looks correct to me, but I wonder: is it clear to you why
> the issue only started occurring in this release? This code had "always"
> been there.
Because we didn't used to force resets on abnormal connection
terminations, so it still worked by accident.
> I see a few possible directions but I'm not quite sure. Not that
> important anyway, if you could reproduce the issue and this fixes it.
Ah, actually, I do still need to test with the original reproducer.
It fixes it for my reproducer which I'm maybe 90% confident is
exercising the same bug.
> Just one nit:
>
> > Link: https://bugs.passt.top/show_bug.cgi?id=202
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>
> Reported-by: Paul Holzinger <pholzing@redhat.com>
Good point, fixed.
>
> > ---
> > tcp_splice.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 1359d6b8..34ffea73 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -605,7 +605,7 @@ retry:
> > }
> > }
> >
> > - break;
> > + continue;
> > }
> >
> > conn->read[fromsidei] += readlen > 0 ? readlen : 0;
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding
2026-05-20 20:28 ` Stefano Brivio
@ 2026-05-21 0:50 ` David Gibson
0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2026-05-21 0:50 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 7981 bytes --]
On Wed, May 20, 2026 at 10:28:52PM +0200, Stefano Brivio wrote:
> Ah, yes, it looks better now. Three remarks:
>
> On Wed, 20 May 2026 23:08:48 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Splice forwarding can be blocked either waiting for data from one side
> > or waiting for space on the other. For that reason,
> > tcp_splice_sock_handler() on either socket can forward data in either or
> > both directions, depending on whether we have EPOLLIN, EPOLLOUT or both
> > events.
> >
> > The flow control for this is quite hard to follow though, since we forward
> > in one direction, then sometimes loop back with a goto to do it in the
> > other direction. Simplify this by adding a tcp_splice_forward() function
> > with the logic to forward in one direction and calling it either once or
> > twice from tcp_splice_sock_handler().
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_splice.c | 137 ++++++++++++++++++++++++++-------------------------
> > 1 file changed, 71 insertions(+), 66 deletions(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 34ffea73..18e8b303 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -474,67 +474,20 @@ void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0)
> > }
> >
> > /**
> > - * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
> > + * tcp_splice_forward() - Forward data in one direction using splice()
> > * @c: Execution context
> > - * @ref: epoll reference
> > - * @events: epoll events bitmap
> > + * @conn: Connection to forward data for
> > + * @fromsidei: Side to forward data from
> > *
> > * #syscalls:pasta splice
> > */
> > -void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> > - uint32_t events)
> > +static int tcp_splice_forward(struct ctx *c, struct
> > + tcp_splice_conn *conn, unsigned fromsidei)
>
> I think the struct
> argument should all be on the same line.
Oops, definitely. Forgot to document the return value too.
> > {
> > - struct tcp_splice_conn *conn = conn_at_sidx(ref.flowside);
> > - unsigned evsidei = ref.flowside.sidei, fromsidei;
> > - uint8_t lowat_set_flag, lowat_act_flag;
> > - int eof, never_read;
> > -
> > - assert(conn->f.type == FLOW_TCP_SPLICE);
> > -
> > - if (conn->events == SPLICE_CLOSED)
> > - return;
> > -
> > - if (events & EPOLLERR) {
> > - int err, rc;
> > - socklen_t sl = sizeof(err);
> > -
> > - rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
> > - if (rc)
> > - flow_perror(conn, "Error retrieving SO_ERROR");
> > - else
> > - flow_dbg(conn, "Error event on %s socket: %s",
> > - pif_name(conn->f.pif[evsidei]),
> > - strerror_(err));
> > - goto reset;
> > - }
> > -
> > - if (conn->events == SPLICE_CONNECT) {
> > - if (!(events & EPOLLOUT)) {
> > - flow_err(conn, "Unexpected events 0x%x during connect",
> > - events);
> > - goto reset;
> > - }
> > - if (tcp_splice_connect_finish(c, conn))
> > - goto reset;
> > - }
> > -
> > - if (events & EPOLLOUT) {
> > - fromsidei = !evsidei;
> > - conn_event(conn, ~OUT_WAIT(evsidei));
> > - } else {
> > - fromsidei = evsidei;
> > - }
> > -
> > - if (events & EPOLLRDHUP)
> > - /* For side 0 this is fake, but implied */
> > - conn_event(conn, FIN_RCVD(evsidei));
> > -
> > -swap:
> > - eof = 0;
> > - never_read = 1;
> > -
> > - lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > - lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > + uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > + uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > + int never_read = 1;
> > + int eof = 0;
> >
> > while (1) {
> > ssize_t readlen, written, pending;
> > @@ -551,7 +504,7 @@ retry:
> > if (readlen < 0 && errno != EAGAIN) {
> > flow_perror(conn, "Splicing from %s socket",
> > pif_name(conn->f.pif[fromsidei]));
> > - goto reset;
> > + return -1;
> > }
> >
> > flow_trace(conn, "%zi from read-side call", readlen);
> > @@ -578,7 +531,7 @@ retry:
> > if (written < 0 && errno != EAGAIN) {
> > flow_perror(conn, "Splicing to %s socket",
> > pif_name(conn->f.pif[!fromsidei]));
> > - goto reset;
> > + return -1;
> > }
> >
> > flow_trace(conn, "%zi from write-side call (passed %zi)",
> > @@ -639,24 +592,76 @@ retry:
> > if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> > flow_perror(conn, "shutdown() on %s",
> > pif_name(conn->f.pif[!sidei]));
> > - goto reset;
> > + return -1;
> > }
> > conn_event(conn, FIN_SENT(!sidei));
> > }
> > }
> > }
> >
> > - if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1))) {
> > - /* Clean close, no reset */
> > - conn_flag(conn, CLOSING);
> > + return 0;
> > +}
> > +
> > +/**
> > + * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
> > + * @c: Execution context
> > + * @ref: epoll reference
> > + * @events: epoll events bitmap
> > + */
> > +void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> > + uint32_t events)
> > +{
> > + struct tcp_splice_conn *conn = conn_at_sidx(ref.flowside);
> > + unsigned evsidei = ref.flowside.sidei;
> > +
> > + assert(conn->f.type == FLOW_TCP_SPLICE);
> > +
> > + if (conn->events == SPLICE_CLOSED)
> > return;
> > +
> > + if (events & EPOLLERR) {
> > + int err, rc;
> > + socklen_t sl = sizeof(err);
> > +
> > + rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
> > + if (rc)
> > + flow_perror(conn, "Error retrieving SO_ERROR");
> > + else
> > + flow_dbg(conn, "Error event on %s socket: %s",
> > + pif_name(conn->f.pif[evsidei]),
> > + strerror_(err));
> > + goto reset;
> > + }
> > +
> > + if (conn->events == SPLICE_CONNECT) {
> > + if (!(events & EPOLLOUT)) {
> > + flow_err(conn, "Unexpected events 0x%x during connect",
> > + events);
> > + goto reset;
> > + }
> > + if (tcp_splice_connect_finish(c, conn))
> > + goto reset;
> > + }
> > +
> > + if (events & EPOLLRDHUP)
> > + /* For side 0 this is fake, but implied */
> > + conn_event(conn, FIN_RCVD(evsidei));
>
> I saw this all goes away in 5/6, so it wouldn't be relevant. But in
> case we decide to drop 5/6, here are my remarks on the this.
>
> EPOLLRDHUP is now handled before checking the other direction of the
> connection in case of EPOLLOUT.
I'm pretty sure that hasn't changed. In the old code EPOLLRDHUP
handling was before we did any of the actual data handling for EPOLLIN
or EPOLLOUT.
> I think it actually makes more sense this way because we update flags
> with everything we know until that point, and it shouldn't have a
> functional effect (the check at the end of the new tcp_splice_forward()
> is on FIN_RCVD(fromsidei)), but I'm raising that in case the change
> wasn't intended.
>
> > +
> > + if (events & EPOLLOUT) {
> > + if (tcp_splice_forward(c, conn, !evsidei))
> > + goto reset;
> > + conn_event(conn, ~OUT_WAIT(evsidei));
> > }
> >
> > - if ((events & (EPOLLIN | EPOLLOUT)) == (EPOLLIN | EPOLLOUT)) {
> > - events = EPOLLIN;
> > + if (events & EPOLLIN) {
> > + if (tcp_splice_forward(c, conn, evsidei))
> > + goto reset;
>
> This should be:
>
> goto reset;
>
> instead of:
>
> goto reset;
Oops, fixed.
>
> > + }
> >
> > - fromsidei = !fromsidei;
> > - goto swap;
> > + if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1))) {
> > + /* Clean close, no reset */
> > + conn_flag(conn, CLOSING);
> > + return;
> > }
> >
> > if (events & EPOLLHUP) {
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes
2026-05-20 20:29 ` Stefano Brivio
@ 2026-05-21 0:54 ` David Gibson
0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2026-05-21 0:54 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 5947 bytes --]
On Wed, May 20, 2026 at 10:29:12PM +0200, Stefano Brivio wrote:
> On Wed, 20 May 2026 23:08:49 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > For each each direction of each spliced connection, we keep track of how
> > many bytes we've read from one socket and written to the other. However,
> > we never actually care about the absolute values of these, only the
> > difference between them, which represents how much data is currently "in
> > flight" in the splicing pipe.
> >
> > Simplify the handling by having a single variable tracking the number of
> > bytes in the pipe.
>
> For me it actually looks slightly more complicated to think about it
> this way, I added explicit 'read' and 'written' after being bitten by
> some issue I introduced with a previous 'pending' concept, but I have
> to admit it slightly simplifies the overflow topic.
>
> > As a bonus, the new scheme makes it clearer that we don't need to worry
> > about overflows: pending can never become larger than the maximum pipe
> > bufffer size, well within 32-bits.
> >
> > I _think_ the old scheme was safe in the case of overflow - again under
> > the assumption that read/written can never be further apart than the pipe
> > buffer size. However, it's much harder to reason about this case. It's
> > certainly plausible that an overflow could occur - sending 4GiB through
> > a local socket is entirely achievable.
>
> For me it looked pretty simple: you can overflow 32 bits (at 100 Gbps,
> but without hitting the "optimised" case, it would take about five
> minutes), but all the operations between the two counters are between
> two uint32_t, so they would happen in uint32_t, hence modulo 32 bits,
> similar to TCP sequences.
Plus it's only equality comparisons, so we don't need SEQ_GT or the
like. Yeah, that's the reasoning, but to me that's still a lot more
than "can't exceed pipe size".
> Anyway, overall, I think it's an improvement over the original. One nit
> here:
>
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_conn.h | 6 ++----
> > tcp_splice.c | 18 +++++++++---------
> > 2 files changed, 11 insertions(+), 13 deletions(-)
> >
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 9f5bee03..c8381aa7 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -206,8 +206,7 @@ struct tcp_tap_transfer_ext {
> > * @f: Generic flow information
> > * @s: File descriptor for sockets
> > * @pipe: File descriptors for pipes
> > - * @read: Bytes read (not fully written to other side in one shot)
> > - * @written: Bytes written (not fully written from one other side read)
> > + * @pending: Bytes currently in each pipe
> > * @events: Events observed/actions performed on connection
> > * @flags: Connection flags (attributes, not events)
> > */
> > @@ -218,8 +217,7 @@ struct tcp_splice_conn {
> > int s[SIDES];
> > int pipe[SIDES][2];
> >
> > - uint32_t read[SIDES];
> > - uint32_t written[SIDES];
> > + uint32_t pending[SIDES];
> >
> > uint8_t events;
> > #define SPLICE_CLOSED 0
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 18e8b303..8fbd490f 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -292,7 +292,7 @@ bool tcp_splice_flow_defer(struct tcp_splice_conn *conn)
> > conn->s[sidei] = -1;
> > }
> >
> > - conn->read[sidei] = conn->written[sidei] = 0;
> > + conn->pending[sidei] = 0;
> > }
> >
> > conn->events = SPLICE_CLOSED;
> > @@ -490,7 +490,7 @@ static int tcp_splice_forward(struct ctx *c, struct
> > int eof = 0;
> >
> > while (1) {
> > - ssize_t readlen, written, pending;
> > + ssize_t readlen, written;
> > int more = 0;
> >
> > retry:
> > @@ -537,7 +537,7 @@ retry:
> > flow_trace(conn, "%zi from write-side call (passed %zi)",
> > written, c->tcp.pipe_size);
> >
> > - /* Most common case: skip updating counters. */
> > + /* Most common case: skip updating pending. */
>
> "pending" isn't a noun (even though the variable name is, but it's
> not quite obvious that you're referring to it). I think that:
>
> /* Most common case: skip updating count of pending bytes */
>
> would be slightly clearer (and also omit the '.' because it's not a
> complete sentence, as we usually do on single-line comments, similarly
> to most occurrences in the kernel).
Done.
>
> > if (readlen > 0 && readlen == written) {
> > if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
> > continue;
> > @@ -561,11 +561,11 @@ retry:
> > continue;
> > }
> >
> > - conn->read[fromsidei] += readlen > 0 ? readlen : 0;
> > - conn->written[fromsidei] += written > 0 ? written : 0;
> > + conn->pending[fromsidei] += readlen > 0 ? readlen : 0;
> > + conn->pending[fromsidei] -= written > 0 ? written : 0;
> >
> > if (written < 0) {
> > - if (conn->read[fromsidei] == conn->written[fromsidei])
> > + if (!conn->pending[fromsidei])
> > break;
> >
> > conn_event(conn, OUT_WAIT(!fromsidei));
> > @@ -575,15 +575,15 @@ retry:
> > if (never_read && written == (long)(c->tcp.pipe_size))
> > goto retry;
> >
> > - pending = conn->read[fromsidei] - conn->written[fromsidei];
> > - if (!never_read && written > 0 && written < pending)
> > + if (!never_read && written > 0 &&
> > + written < conn->pending[fromsidei])
> > goto retry;
> >
> > if (eof)
> > break;
> > }
> >
> > - if (conn->read[fromsidei] == conn->written[fromsidei] && eof) {
> > + if (!conn->pending[fromsidei] && eof) {
> > unsigned sidei;
> >
> > flow_foreach_sidei(sidei) {
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-20 20:30 ` Stefano Brivio
@ 2026-05-21 2:03 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
0 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-21 2:03 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 4410 bytes --]
On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> On Wed, 20 May 2026 23:08:50 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > There are two ways we can tell one of our sockets has received a FIN. We
> > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > some other close out logic is based on seeing an EOF.
> >
> > Simplify this by setting the flag based on only the EOF. To make sure we
> > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_splice.c | 14 +++++---------
> > 1 file changed, 5 insertions(+), 9 deletions(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 8fbd490f..b45f0060 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > int never_read = 1;
> > - int eof = 0;
> >
> > while (1) {
> > ssize_t readlen, written;
> > @@ -510,7 +509,7 @@ retry:
> > flow_trace(conn, "%zi from read-side call", readlen);
> >
> > if (!readlen) {
> > - eof = 1;
> > + conn_event(conn, FIN_RCVD(fromsidei));
>
> I'm not sure if I really found a concrete issue with this, but it looks
> a bit scary, because it changes the semantics of FIN_RCVD, which used to
> mean that we infer we received a FIN, regardless of whether we're done
> processing all data from that half of the connection.
>
> Now FIN_RCVD is only set if we actually processed all the data and we
> hit the end of file.
True. But the only place that tested FIN_RCVD was at the end of
tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
was the cause of bug202 - we had FIN_RCVD set, but we didn't process
it and shutdown() on the other side, because we didn't have eof.
> The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> returns -1 with EAGAIN in errno because we had no room in the pipe,
> and it would have returned 0 instead.
>
> Will we ever get our zero-sized "read" later? If not, we might have
> missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> guarantees in that sense from splice().
It's not really about guarantees from splice. I'm pretty sure this is
ok, reasoning as follows.
Consider all the exit points from the loop body:
- Each return is a return -1, so we kill the connection anyway. They
don't matter
- Each continue, goto retry and the end of the body will do the read
side splice() again, so get another chance to see the EOF
- That leaves just the breaks
Consider each break (there are three, since patch 2 of this series)
if (written < 0) {
if (!conn->pending[fromsidei])
break;
(1) The pipe is empty and the write-splice returned EAGAIN, so it
didn't remove data from the pipe. Therefore, the pipe must have been
empty before the write-splice. Which means the read-splice can't have
blocked on a full pipe.
conn_event(conn, OUT_WAIT(!fromsidei));
break;
}
(2) The pipe is non-empty and the write-splice returned EAGAIN, so it
must have blocked on the output socket. We've set OUT_WAIT(), so
we'll get an EPOLLOUT at some point which will cause us to read-splice
again, meaning we get another chance to see the EOF.
[...]
if (conn->events & FIN_RCVD(fromsidei))
break;
(3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> The existing implementation distinguishes between end-of-file we hit in
> a given iteration, and EPOLLRDHUP we might have seen at any time.
> That was actually intended.
It might be intended, but I can't see that we did anything with that
information.
That said the conditions on which we exit / retry this loop are pretty
darn confusing. I'll see if I can improve them.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling
2026-05-20 20:30 ` Stefano Brivio
@ 2026-05-21 2:11 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
0 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-21 2:11 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 2812 bytes --]
On Wed, May 20, 2026 at 10:30:23PM +0200, Stefano Brivio wrote:
> On Wed, 20 May 2026 23:08:51 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > At the end of tcp_splice_forward(), we check for half-closed connections
> > and propagate the FIN to the other side with a shutdown(2). Currently we
> > check for a half closed connection in either direction. That's unnecessary
> > here, because tcp_splice_forward() will already be called for each
> > direction if there are any relevant events.
>
> True, but do we have the guarantee that tcp_splice_forward() will also
> be called once all relevant FIN_RCVD / FIN_SENT flags have been sent?
Yes, because tcp_splice_forward() is (now) the only place that *sets*
FIN_RCVD (or FIN_SENT).
> The reason why we check both sides here is that we might have updated
> flags for one side, and now we need to double check if it's time to
> call shutdown() as a consequence.
>
> Maybe we never have to, but I think it's not really obvious to prove.
tcp_splice_forward() only touches FIN_RCVD(fromsidei). So, we only
need to examine FIN_RCVD(fromsidei). If FIN_RCVD is set for the other
side, it must be in another call to tcp_splice_forward() which will
also examine that other flag and shutdown() as necessary.
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_splice.c | 22 ++++++++--------------
> > 1 file changed, 8 insertions(+), 14 deletions(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index b45f0060..e5018f2e 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -582,21 +582,15 @@ retry:
> > break;
> > }
> >
> > - if (!conn->pending[fromsidei] &&
> > - conn->events & FIN_RCVD(fromsidei)) {
> > - unsigned sidei;
> > -
> > - flow_foreach_sidei(sidei) {
> > - if ((conn->events & FIN_RCVD(sidei)) &&
> > - !(conn->events & FIN_SENT(!sidei))) {
> > - if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> > - flow_perror(conn, "shutdown() on %s",
> > - pif_name(conn->f.pif[!sidei]));
> > - return -1;
> > - }
> > - conn_event(conn, FIN_SENT(!sidei));
> > - }
> > + if ((conn->events & FIN_RCVD(fromsidei)) &&
> > + !(conn->events & FIN_SENT(!fromsidei)) &&
> > + !conn->pending[fromsidei]) {
> > + if (shutdown(conn->s[!fromsidei], SHUT_WR) < 0) {
> > + flow_perror(conn, "shutdown() on %s",
> > + pif_name(conn->f.pif[!fromsidei]));
> > + return -1;
> > }
> > + conn_event(conn, FIN_SENT(!fromsidei));
> > }
> >
> > return 0;
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] tcp_splice: Improve error reporting
2026-05-21 0:43 ` David Gibson
@ 2026-05-21 5:08 ` Stefano Brivio
0 siblings, 0 replies; 27+ messages in thread
From: Stefano Brivio @ 2026-05-21 5:08 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger, Anshu Kumari
On Thu, 21 May 2026 10:43:37 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, May 20, 2026 at 04:31:34PM +0200, Stefano Brivio wrote:
> > On Wed, 20 May 2026 23:08:46 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > A number of things can, at least theoretically, go wrong when forwarding
> > > data across a spliced connection. We generally handle this by resetting
> > > the connection on both sides. However, in many cases we don't log any
> > > message about why the connection was reset, which can make it hard to
> > > debug why this is happening.
> > >
> > > Add a bunch of debug and error logging to make this easier to figure out.
> > >
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > > tcp_splice.c | 31 +++++++++++++++++++++++--------
> > > 1 file changed, 23 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > index 42ee8abc..1359d6b8 100644
> > > --- a/tcp_splice.c
> > > +++ b/tcp_splice.c
> > > @@ -502,15 +502,18 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> > > if (rc)
> > > flow_perror(conn, "Error retrieving SO_ERROR");
> > > else
> > > - flow_trace(conn, "Error event on socket: %s",
> > > - strerror_(err));
> > > -
> > > + flow_dbg(conn, "Error event on %s socket: %s",
> > > + pif_name(conn->f.pif[evsidei]),
> > > + strerror_(err));
> > > goto reset;
> > > }
> > >
> > > if (conn->events == SPLICE_CONNECT) {
> > > - if (!(events & EPOLLOUT))
> > > + if (!(events & EPOLLOUT)) {
> > > + flow_err(conn, "Unexpected events 0x%x during connect",
> > > + events);
> >
> > Shouldn't all the flow_err() and flow_perror() calls here be
> > ratelimited, that is, eventually calling the err_ratelimit() function
> > Anshu introduced recently?
>
> I did think about that, I concluded it wasn't necessary here because
> it indicates that something has gone unexpectedly wrong at the kernel
> level, it's not guest triggerable.
>
> I can put in ratelimits if you still think they're necessary.
I do, yes, because we're dangerously close to something the container
can indirectly try to trigger, I think.
> > We don't have helpers ready for flow_err() and flow_perror(), I was
> > about to post a patch that would go before this series but I'm not sure
> > if there's a specific reason to avoid those.
> >
> > > goto reset;
> > > + }
> > > if (tcp_splice_connect_finish(c, conn))
> > > goto reset;
> > > }
> > > @@ -545,8 +548,11 @@ retry:
> > > SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
> > > while (readlen < 0 && errno == EINTR);
> > >
> > > - if (readlen < 0 && errno != EAGAIN)
> > > + if (readlen < 0 && errno != EAGAIN) {
> > > + flow_perror(conn, "Splicing from %s socket",
> > > + pif_name(conn->f.pif[fromsidei]));
> > > goto reset;
> > > + }
> > >
> > > flow_trace(conn, "%zi from read-side call", readlen);
> > >
> > > @@ -569,8 +575,11 @@ retry:
> > > SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
> > > while (written < 0 && errno == EINTR);
> > >
> > > - if (written < 0 && errno != EAGAIN)
> > > + if (written < 0 && errno != EAGAIN) {
> > > + flow_perror(conn, "Splicing to %s socket",
> > > + pif_name(conn->f.pif[!fromsidei]));
> > > goto reset;
> > > + }
> > >
> > > flow_trace(conn, "%zi from write-side call (passed %zi)",
> > > written, c->tcp.pipe_size);
> > > @@ -627,8 +636,11 @@ retry:
> > > flow_foreach_sidei(sidei) {
> > > if ((conn->events & FIN_RCVD(sidei)) &&
> > > !(conn->events & FIN_SENT(!sidei))) {
> > > - if (shutdown(conn->s[!sidei], SHUT_WR) < 0)
> > > + if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> > > + flow_perror(conn, "shutdown() on %s",
> > > + pif_name(conn->f.pif[!sidei]));
> > > goto reset;
> > > + }
> > > conn_event(conn, FIN_SENT(!sidei));
> > > }
> > > }
> > > @@ -647,8 +659,11 @@ retry:
> > > goto swap;
> > > }
> > >
> > > - if (events & EPOLLHUP)
> > > + if (events & EPOLLHUP) {
> > > + flow_dbg(conn, "Hangup from %s socket",
> > > + pif_name(conn->f.pif[evsidei]));
>
> Except for this one, which is debug level for that reason.
>
> > > goto reset;
> > > + }
> > >
> > > return;
> > >
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-21 2:03 ` David Gibson
@ 2026-05-21 5:40 ` Stefano Brivio
2026-05-21 6:56 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-21 5:40 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Thu, 21 May 2026 12:03:33 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> > On Wed, 20 May 2026 23:08:50 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > There are two ways we can tell one of our sockets has received a FIN. We
> > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > > some other close out logic is based on seeing an EOF.
> > >
> > > Simplify this by setting the flag based on only the EOF. To make sure we
> > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > >
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > > tcp_splice.c | 14 +++++---------
> > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > index 8fbd490f..b45f0060 100644
> > > --- a/tcp_splice.c
> > > +++ b/tcp_splice.c
> > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > > int never_read = 1;
> > > - int eof = 0;
> > >
> > > while (1) {
> > > ssize_t readlen, written;
> > > @@ -510,7 +509,7 @@ retry:
> > > flow_trace(conn, "%zi from read-side call", readlen);
> > >
> > > if (!readlen) {
> > > - eof = 1;
> > > + conn_event(conn, FIN_RCVD(fromsidei));
> >
> > I'm not sure if I really found a concrete issue with this, but it looks
> > a bit scary, because it changes the semantics of FIN_RCVD, which used to
> > mean that we infer we received a FIN, regardless of whether we're done
> > processing all data from that half of the connection.
> >
> > Now FIN_RCVD is only set if we actually processed all the data and we
> > hit the end of file.
>
> True. But the only place that tested FIN_RCVD was at the end of
> tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
> was the cause of bug202 - we had FIN_RCVD set, but we didn't process
> it and shutdown() on the other side, because we didn't have eof.
That sounds like a good motivation to clean this up, just two concerns
below:
> > The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> > returns -1 with EAGAIN in errno because we had no room in the pipe,
> > and it would have returned 0 instead.
> >
> > Will we ever get our zero-sized "read" later? If not, we might have
> > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> > guarantees in that sense from splice().
>
> It's not really about guarantees from splice. I'm pretty sure this is
> ok, reasoning as follows.
>
> Consider all the exit points from the loop body:
> - Each return is a return -1, so we kill the connection anyway. They
> don't matter
> - Each continue, goto retry and the end of the body will do the read
> side splice() again, so get another chance to see the EOF
> - That leaves just the breaks
>
> Consider each break (there are three, since patch 2 of this series)
> if (written < 0) {
> if (!conn->pending[fromsidei])
> break;
>
> (1) The pipe is empty and the write-splice returned EAGAIN, so it
> didn't remove data from the pipe.
You're assuming that !conn->pending[fromsidei] means that the pipe is
empty. From what we see of it, it is.
What the kernel can do with it, though, is different. It might return
EAGAIN even if we think we should have space, because it's resizing it
under memory pressure or anything like that. Or it delays freeing up
space or accounting for whatever reason.
So it would be nice to make this part robust to that. I thought setting
FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
> Therefore, the pipe must have been
> empty before the write-splice. Which means the read-splice can't have
> blocked on a full pipe.
> conn_event(conn, OUT_WAIT(!fromsidei));
> break;
> }
>
> (2) The pipe is non-empty and the write-splice returned EAGAIN, so it
> must have blocked on the output socket. We've set OUT_WAIT(), so
> we'll get an EPOLLOUT at some point which will cause us to read-splice
> again, meaning we get another chance to see the EOF.
...later. But what if we don't get a zero-sized read *at all*? I'm not
sure if splice() guarantees we do get one if we reach end-of-file.
That's something valid and very well established for read() and recv(),
but splice() is a bit weird. The documentation says:
A return value of 0 means end of input.
but I wouldn't assume we'll *always* get at least one in case of EOF.
>
> [...]
> if (conn->events & FIN_RCVD(fromsidei))
> break;
> (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
>
> > The existing implementation distinguishes between end-of-file we hit in
> > a given iteration, and EPOLLRDHUP we might have seen at any time.
> > That was actually intended.
>
> It might be intended, but I can't see that we did anything with that
> information.
We always set FIN_RCVD on it. You're right, if we only checked that on
'eof', that didn't solve much, but that wasn't necessarily intended. My
original intention was to make setting of FIN_RCVD (or whatever it was
originally) robust.
> That said the conditions on which we exit / retry this loop are pretty
> darn confusing. I'll see if I can improve them.
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling
2026-05-21 2:11 ` David Gibson
@ 2026-05-21 5:40 ` Stefano Brivio
0 siblings, 0 replies; 27+ messages in thread
From: Stefano Brivio @ 2026-05-21 5:40 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Thu, 21 May 2026 12:11:49 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, May 20, 2026 at 10:30:23PM +0200, Stefano Brivio wrote:
> > On Wed, 20 May 2026 23:08:51 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > At the end of tcp_splice_forward(), we check for half-closed connections
> > > and propagate the FIN to the other side with a shutdown(2). Currently we
> > > check for a half closed connection in either direction. That's unnecessary
> > > here, because tcp_splice_forward() will already be called for each
> > > direction if there are any relevant events.
> >
> > True, but do we have the guarantee that tcp_splice_forward() will also
> > be called once all relevant FIN_RCVD / FIN_SENT flags have been sent?
>
> Yes, because tcp_splice_forward() is (now) the only place that *sets*
> FIN_RCVD (or FIN_SENT).
>
> > The reason why we check both sides here is that we might have updated
> > flags for one side, and now we need to double check if it's time to
> > call shutdown() as a consequence.
> >
> > Maybe we never have to, but I think it's not really obvious to prove.
>
> tcp_splice_forward() only touches FIN_RCVD(fromsidei). So, we only
> need to examine FIN_RCVD(fromsidei). If FIN_RCVD is set for the other
> side, it must be in another call to tcp_splice_forward() which will
> also examine that other flag and shutdown() as necessary.
Okay, fair. That's what I call "not really obvious" but it's probably
obvious enough.
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > > tcp_splice.c | 22 ++++++++--------------
> > > 1 file changed, 8 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > index b45f0060..e5018f2e 100644
> > > --- a/tcp_splice.c
> > > +++ b/tcp_splice.c
> > > @@ -582,21 +582,15 @@ retry:
> > > break;
> > > }
> > >
> > > - if (!conn->pending[fromsidei] &&
> > > - conn->events & FIN_RCVD(fromsidei)) {
> > > - unsigned sidei;
> > > -
> > > - flow_foreach_sidei(sidei) {
> > > - if ((conn->events & FIN_RCVD(sidei)) &&
> > > - !(conn->events & FIN_SENT(!sidei))) {
> > > - if (shutdown(conn->s[!sidei], SHUT_WR) < 0) {
> > > - flow_perror(conn, "shutdown() on %s",
> > > - pif_name(conn->f.pif[!sidei]));
> > > - return -1;
> > > - }
> > > - conn_event(conn, FIN_SENT(!sidei));
> > > - }
> > > + if ((conn->events & FIN_RCVD(fromsidei)) &&
> > > + !(conn->events & FIN_SENT(!fromsidei)) &&
> > > + !conn->pending[fromsidei]) {
> > > + if (shutdown(conn->s[!fromsidei], SHUT_WR) < 0) {
> > > + flow_perror(conn, "shutdown() on %s",
> > > + pif_name(conn->f.pif[!fromsidei]));
> > > + return -1;
> > > }
> > > + conn_event(conn, FIN_SENT(!fromsidei));
> > > }
> > >
> > > return 0;
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-21 5:40 ` Stefano Brivio
@ 2026-05-21 6:56 ` David Gibson
2026-05-21 7:15 ` Stefano Brivio
0 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-21 6:56 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 7017 bytes --]
On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote:
> On Thu, 21 May 2026 12:03:33 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> > > On Wed, 20 May 2026 23:08:50 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > There are two ways we can tell one of our sockets has received a FIN. We
> > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > > > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > > > some other close out logic is based on seeing an EOF.
> > > >
> > > > Simplify this by setting the flag based on only the EOF. To make sure we
> > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > > >
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > > tcp_splice.c | 14 +++++---------
> > > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > > index 8fbd490f..b45f0060 100644
> > > > --- a/tcp_splice.c
> > > > +++ b/tcp_splice.c
> > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > > > int never_read = 1;
> > > > - int eof = 0;
> > > >
> > > > while (1) {
> > > > ssize_t readlen, written;
> > > > @@ -510,7 +509,7 @@ retry:
> > > > flow_trace(conn, "%zi from read-side call", readlen);
> > > >
> > > > if (!readlen) {
> > > > - eof = 1;
> > > > + conn_event(conn, FIN_RCVD(fromsidei));
> > >
> > > I'm not sure if I really found a concrete issue with this, but it looks
> > > a bit scary, because it changes the semantics of FIN_RCVD, which used to
> > > mean that we infer we received a FIN, regardless of whether we're done
> > > processing all data from that half of the connection.
> > >
> > > Now FIN_RCVD is only set if we actually processed all the data and we
> > > hit the end of file.
> >
> > True. But the only place that tested FIN_RCVD was at the end of
> > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
> > was the cause of bug202 - we had FIN_RCVD set, but we didn't process
> > it and shutdown() on the other side, because we didn't have eof.
>
> That sounds like a good motivation to clean this up, just two concerns
> below:
>
> > > The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> > > returns -1 with EAGAIN in errno because we had no room in the pipe,
> > > and it would have returned 0 instead.
> > >
> > > Will we ever get our zero-sized "read" later? If not, we might have
> > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> > > guarantees in that sense from splice().
> >
> > It's not really about guarantees from splice. I'm pretty sure this is
> > ok, reasoning as follows.
> >
> > Consider all the exit points from the loop body:
> > - Each return is a return -1, so we kill the connection anyway. They
> > don't matter
> > - Each continue, goto retry and the end of the body will do the read
> > side splice() again, so get another chance to see the EOF
> > - That leaves just the breaks
> >
> > Consider each break (there are three, since patch 2 of this series)
> > if (written < 0) {
> > if (!conn->pending[fromsidei])
> > break;
> >
> > (1) The pipe is empty and the write-splice returned EAGAIN, so it
> > didn't remove data from the pipe.
>
> You're assuming that !conn->pending[fromsidei] means that the pipe is
> empty. From what we see of it, it is.
It does mean the pipe is empty. Everything we put in, we've taken
out. There cannot be anything in there.
> What the kernel can do with it, though, is different. It might return
> EAGAIN even if we think we should have space, because it's resizing it
> under memory pressure or anything like that. Or it delays freeing up
> space or accounting for whatever reason.
Theoretically, I suppose. But !pending doesn't just mean the pipe is
not full it means it's completely completely empty. Not being able to
put any bytes at all into an empty pipe would be *very* surprising.
So much so that if it happened in practice, I suspect we wouldn't be
safe not having epoll events on the pipe ends, so that we can be
notified when it deigns to accept some data.
> So it would be nice to make this part robust to that. I thought setting
> FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
>
> > Therefore, the pipe must have been
> > empty before the write-splice. Which means the read-splice can't have
> > blocked on a full pipe.
> > conn_event(conn, OUT_WAIT(!fromsidei));
> > break;
> > }
> >
> > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it
> > must have blocked on the output socket. We've set OUT_WAIT(), so
> > we'll get an EPOLLOUT at some point which will cause us to read-splice
> > again, meaning we get another chance to see the EOF.
>
> ...later. But what if we don't get a zero-sized read *at all*? I'm not
> sure if splice() guarantees we do get one if we reach end-of-file.
> That's something valid and very well established for read() and recv(),
> but splice() is a bit weird. The documentation says:
>
> A return value of 0 means end of input.
>
> but I wouldn't assume we'll *always* get at least one in case of EOF.
What else could we plausibly get?
> > [...]
> > if (conn->events & FIN_RCVD(fromsidei))
> > break;
> > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> >
> > > The existing implementation distinguishes between end-of-file we hit in
> > > a given iteration, and EPOLLRDHUP we might have seen at any time.
> > > That was actually intended.
> >
> > It might be intended, but I can't see that we did anything with that
> > information.
>
> We always set FIN_RCVD on it. You're right, if we only checked that on
> 'eof', that didn't solve much, but that wasn't necessarily intended. My
> original intention was to make setting of FIN_RCVD (or whatever it was
> originally) robust.
Ok, well. I've spotted other changes to make in the vicinity that I
think will make some of this easier to reason about anyway. So I'll
consider your points as I rework this and other patches.
> > That said the conditions on which we exit / retry this loop are pretty
> > darn confusing. I'll see if I can improve them.
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-21 6:56 ` David Gibson
@ 2026-05-21 7:15 ` Stefano Brivio
2026-05-21 13:51 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-21 7:15 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Thu, 21 May 2026 16:56:43 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote:
> > On Thu, 21 May 2026 12:03:33 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> > > > On Wed, 20 May 2026 23:08:50 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > There are two ways we can tell one of our sockets has received a FIN. We
> > > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > > > > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > > > > some other close out logic is based on seeing an EOF.
> > > > >
> > > > > Simplify this by setting the flag based on only the EOF. To make sure we
> > > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > > > >
> > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > ---
> > > > > tcp_splice.c | 14 +++++---------
> > > > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > > > index 8fbd490f..b45f0060 100644
> > > > > --- a/tcp_splice.c
> > > > > +++ b/tcp_splice.c
> > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > > > > int never_read = 1;
> > > > > - int eof = 0;
> > > > >
> > > > > while (1) {
> > > > > ssize_t readlen, written;
> > > > > @@ -510,7 +509,7 @@ retry:
> > > > > flow_trace(conn, "%zi from read-side call", readlen);
> > > > >
> > > > > if (!readlen) {
> > > > > - eof = 1;
> > > > > + conn_event(conn, FIN_RCVD(fromsidei));
> > > >
> > > > I'm not sure if I really found a concrete issue with this, but it looks
> > > > a bit scary, because it changes the semantics of FIN_RCVD, which used to
> > > > mean that we infer we received a FIN, regardless of whether we're done
> > > > processing all data from that half of the connection.
> > > >
> > > > Now FIN_RCVD is only set if we actually processed all the data and we
> > > > hit the end of file.
> > >
> > > True. But the only place that tested FIN_RCVD was at the end of
> > > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
> > > was the cause of bug202 - we had FIN_RCVD set, but we didn't process
> > > it and shutdown() on the other side, because we didn't have eof.
> >
> > That sounds like a good motivation to clean this up, just two concerns
> > below:
> >
> > > > The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> > > > returns -1 with EAGAIN in errno because we had no room in the pipe,
> > > > and it would have returned 0 instead.
> > > >
> > > > Will we ever get our zero-sized "read" later? If not, we might have
> > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> > > > guarantees in that sense from splice().
> > >
> > > It's not really about guarantees from splice. I'm pretty sure this is
> > > ok, reasoning as follows.
> > >
> > > Consider all the exit points from the loop body:
> > > - Each return is a return -1, so we kill the connection anyway. They
> > > don't matter
> > > - Each continue, goto retry and the end of the body will do the read
> > > side splice() again, so get another chance to see the EOF
> > > - That leaves just the breaks
> > >
> > > Consider each break (there are three, since patch 2 of this series)
> > > if (written < 0) {
> > > if (!conn->pending[fromsidei])
> > > break;
> > >
> > > (1) The pipe is empty and the write-splice returned EAGAIN, so it
> > > didn't remove data from the pipe.
> >
> > You're assuming that !conn->pending[fromsidei] means that the pipe is
> > empty. From what we see of it, it is.
>
> It does mean the pipe is empty. Everything we put in, we've taken
> out. There cannot be anything in there.
>
> > What the kernel can do with it, though, is different. It might return
> > EAGAIN even if we think we should have space, because it's resizing it
> > under memory pressure or anything like that. Or it delays freeing up
> > space or accounting for whatever reason.
>
> Theoretically, I suppose. But !pending doesn't just mean the pipe is
> not full it means it's completely completely empty. Not being able to
> put any bytes at all into an empty pipe would be *very* surprising.
> So much so that if it happened in practice, I suspect we wouldn't be
> safe not having epoll events on the pipe ends, so that we can be
> notified when it deigns to accept some data.
We can get 512-byte pipes, I actually saw that happening in practice
with either:
- people setting low values for ulimits
- the user (or just pasta itself) having a lot of pipes open
and if I recall correctly that's where I saw the case of a supposedly
empty pipe giving us EAGAIN. That was years ago though and I didn't
specifically fix that.
We currently probe the size based on the value we can have for 32 pipes
(TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get
rather small pipes.
Things might already be broken with them, I haven't checked the
behaviour in a long while. I think 512 bytes was the lower bound I hit.
> > So it would be nice to make this part robust to that. I thought setting
> > FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
> >
> > > Therefore, the pipe must have been
> > > empty before the write-splice. Which means the read-splice can't have
> > > blocked on a full pipe.
> > > conn_event(conn, OUT_WAIT(!fromsidei));
> > > break;
> > > }
> > >
> > > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it
> > > must have blocked on the output socket. We've set OUT_WAIT(), so
> > > we'll get an EPOLLOUT at some point which will cause us to read-splice
> > > again, meaning we get another chance to see the EOF.
> >
> > ...later. But what if we don't get a zero-sized read *at all*? I'm not
> > sure if splice() guarantees we do get one if we reach end-of-file.
>
> > That's something valid and very well established for read() and recv(),
> > but splice() is a bit weird. The documentation says:
> >
> > A return value of 0 means end of input.
> >
> > but I wouldn't assume we'll *always* get at least one in case of EOF.
>
> What else could we plausibly get?
-1 with EBADF, probably with EPOLLERR, because something timed out?
But I guess you're right, as long as we're not in the EPOLLERR category
of things, we should consistently get 0, even if we read multiple
times.
I had in mind some kernel behaviour where we get 0 once, and then -1
(EAGAIN?) because... go figure. But no, it can't happen.
> > > [...]
> > > if (conn->events & FIN_RCVD(fromsidei))
> > > break;
> > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> > >
> > > > The existing implementation distinguishes between end-of-file we hit in
> > > > a given iteration, and EPOLLRDHUP we might have seen at any time.
> > > > That was actually intended.
> > >
> > > It might be intended, but I can't see that we did anything with that
> > > information.
> >
> > We always set FIN_RCVD on it. You're right, if we only checked that on
> > 'eof', that didn't solve much, but that wasn't necessarily intended. My
> > original intention was to make setting of FIN_RCVD (or whatever it was
> > originally) robust.
>
> Ok, well. I've spotted other changes to make in the vicinity that I
> think will make some of this easier to reason about anyway. So I'll
> consider your points as I rework this and other patches.
>
> > > That said the conditions on which we exit / retry this loop are pretty
> > > darn confusing. I'll see if I can improve them.
> >
> > --
> > Stefano
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-21 7:15 ` Stefano Brivio
@ 2026-05-21 13:51 ` David Gibson
2026-05-21 15:18 ` Stefano Brivio
0 siblings, 1 reply; 27+ messages in thread
From: David Gibson @ 2026-05-21 13:51 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 9550 bytes --]
On Thu, May 21, 2026 at 09:15:13AM +0200, Stefano Brivio wrote:
> On Thu, 21 May 2026 16:56:43 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote:
> > > On Thu, 21 May 2026 12:03:33 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> > > > > On Wed, 20 May 2026 23:08:50 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >
> > > > > > There are two ways we can tell one of our sockets has received a FIN. We
> > > > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > > > > > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > > > > > some other close out logic is based on seeing an EOF.
> > > > > >
> > > > > > Simplify this by setting the flag based on only the EOF. To make sure we
> > > > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > > > > >
> > > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > ---
> > > > > > tcp_splice.c | 14 +++++---------
> > > > > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > > > > >
> > > > > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > > > > index 8fbd490f..b45f0060 100644
> > > > > > --- a/tcp_splice.c
> > > > > > +++ b/tcp_splice.c
> > > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > > > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > > > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > > > > > int never_read = 1;
> > > > > > - int eof = 0;
> > > > > >
> > > > > > while (1) {
> > > > > > ssize_t readlen, written;
> > > > > > @@ -510,7 +509,7 @@ retry:
> > > > > > flow_trace(conn, "%zi from read-side call", readlen);
> > > > > >
> > > > > > if (!readlen) {
> > > > > > - eof = 1;
> > > > > > + conn_event(conn, FIN_RCVD(fromsidei));
> > > > >
> > > > > I'm not sure if I really found a concrete issue with this, but it looks
> > > > > a bit scary, because it changes the semantics of FIN_RCVD, which used to
> > > > > mean that we infer we received a FIN, regardless of whether we're done
> > > > > processing all data from that half of the connection.
> > > > >
> > > > > Now FIN_RCVD is only set if we actually processed all the data and we
> > > > > hit the end of file.
> > > >
> > > > True. But the only place that tested FIN_RCVD was at the end of
> > > > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
> > > > was the cause of bug202 - we had FIN_RCVD set, but we didn't process
> > > > it and shutdown() on the other side, because we didn't have eof.
> > >
> > > That sounds like a good motivation to clean this up, just two concerns
> > > below:
> > >
> > > > > The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> > > > > returns -1 with EAGAIN in errno because we had no room in the pipe,
> > > > > and it would have returned 0 instead.
> > > > >
> > > > > Will we ever get our zero-sized "read" later? If not, we might have
> > > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> > > > > guarantees in that sense from splice().
> > > >
> > > > It's not really about guarantees from splice. I'm pretty sure this is
> > > > ok, reasoning as follows.
> > > >
> > > > Consider all the exit points from the loop body:
> > > > - Each return is a return -1, so we kill the connection anyway. They
> > > > don't matter
> > > > - Each continue, goto retry and the end of the body will do the read
> > > > side splice() again, so get another chance to see the EOF
> > > > - That leaves just the breaks
> > > >
> > > > Consider each break (there are three, since patch 2 of this series)
> > > > if (written < 0) {
> > > > if (!conn->pending[fromsidei])
> > > > break;
> > > >
> > > > (1) The pipe is empty and the write-splice returned EAGAIN, so it
> > > > didn't remove data from the pipe.
> > >
> > > You're assuming that !conn->pending[fromsidei] means that the pipe is
> > > empty. From what we see of it, it is.
> >
> > It does mean the pipe is empty. Everything we put in, we've taken
> > out. There cannot be anything in there.
> >
> > > What the kernel can do with it, though, is different. It might return
> > > EAGAIN even if we think we should have space, because it's resizing it
> > > under memory pressure or anything like that. Or it delays freeing up
> > > space or accounting for whatever reason.
> >
> > Theoretically, I suppose. But !pending doesn't just mean the pipe is
> > not full it means it's completely completely empty. Not being able to
> > put any bytes at all into an empty pipe would be *very* surprising.
> > So much so that if it happened in practice, I suspect we wouldn't be
> > safe not having epoll events on the pipe ends, so that we can be
> > notified when it deigns to accept some data.
>
> We can get 512-byte pipes, I actually saw that happening in practice
> with either:
Sure.. so? We can still put some bytes into it if it's empty.
> - people setting low values for ulimits
>
> - the user (or just pasta itself) having a lot of pipes open
>
> and if I recall correctly that's where I saw the case of a supposedly
> empty pipe giving us EAGAIN. That was years ago though and I didn't
> specifically fix that.
I mean.. that sounds like a kernel bug. If we do have to handle that
case we'll need epoll events on the pipe ends, since none of the
socket events we monitor will trigger when the pipe becomes writable.
> We currently probe the size based on the value we can have for 32 pipes
> (TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get
> rather small pipes.
>
> Things might already be broken with them, I haven't checked the
> behaviour in a long while. I think 512 bytes was the lower bound I hit.
>
> > > So it would be nice to make this part robust to that. I thought setting
> > > FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
> > >
> > > > Therefore, the pipe must have been
> > > > empty before the write-splice. Which means the read-splice can't have
> > > > blocked on a full pipe.
> > > > conn_event(conn, OUT_WAIT(!fromsidei));
> > > > break;
> > > > }
> > > >
> > > > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it
> > > > must have blocked on the output socket. We've set OUT_WAIT(), so
> > > > we'll get an EPOLLOUT at some point which will cause us to read-splice
> > > > again, meaning we get another chance to see the EOF.
> > >
> > > ...later. But what if we don't get a zero-sized read *at all*? I'm not
> > > sure if splice() guarantees we do get one if we reach end-of-file.
> >
> > > That's something valid and very well established for read() and recv(),
> > > but splice() is a bit weird. The documentation says:
> > >
> > > A return value of 0 means end of input.
> > >
> > > but I wouldn't assume we'll *always* get at least one in case of EOF.
> >
> > What else could we plausibly get?
>
> -1 with EBADF, probably with EPOLLERR, because something timed out?
EBADF makes no sense, the fds are still valid, even if they're at EOF.
> But I guess you're right, as long as we're not in the EPOLLERR category
> of things, we should consistently get 0, even if we read multiple
> times.
>
> I had in mind some kernel behaviour where we get 0 once, and then -1
> (EAGAIN?) because... go figure. But no, it can't happen.
I think the logic should be ok as long as we see a 0 once, even if we
get EAGAINs after that.
Another way to look at this - if we had to monitor EPOLLRDHUP to get
this right, splice() would be unusable from blocking / synchronous
code, which I don't think is the case.
>
> > > > [...]
> > > > if (conn->events & FIN_RCVD(fromsidei))
> > > > break;
> > > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> > > >
> > > > > The existing implementation distinguishes between end-of-file we hit in
> > > > > a given iteration, and EPOLLRDHUP we might have seen at any time.
> > > > > That was actually intended.
> > > >
> > > > It might be intended, but I can't see that we did anything with that
> > > > information.
> > >
> > > We always set FIN_RCVD on it. You're right, if we only checked that on
> > > 'eof', that didn't solve much, but that wasn't necessarily intended. My
> > > original intention was to make setting of FIN_RCVD (or whatever it was
> > > originally) robust.
> >
> > Ok, well. I've spotted other changes to make in the vicinity that I
> > think will make some of this easier to reason about anyway. So I'll
> > consider your points as I rework this and other patches.
> >
> > > > That said the conditions on which we exit / retry this loop are pretty
> > > > darn confusing. I'll see if I can improve them.
> > >
> > > --
> > > Stefano
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-21 13:51 ` David Gibson
@ 2026-05-21 15:18 ` Stefano Brivio
2026-05-22 1:29 ` David Gibson
0 siblings, 1 reply; 27+ messages in thread
From: Stefano Brivio @ 2026-05-21 15:18 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Paul Holzinger
On Thu, 21 May 2026 23:51:04 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Thu, May 21, 2026 at 09:15:13AM +0200, Stefano Brivio wrote:
> > On Thu, 21 May 2026 16:56:43 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote:
> > > > On Thu, 21 May 2026 12:03:33 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> > > > > > On Wed, 20 May 2026 23:08:50 +1000
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >
> > > > > > > There are two ways we can tell one of our sockets has received a FIN. We
> > > > > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > > > > > > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > > > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > > > > > > some other close out logic is based on seeing an EOF.
> > > > > > >
> > > > > > > Simplify this by setting the flag based on only the EOF. To make sure we
> > > > > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > > > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > > > > > >
> > > > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > ---
> > > > > > > tcp_splice.c | 14 +++++---------
> > > > > > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > > > > > >
> > > > > > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > > > > > index 8fbd490f..b45f0060 100644
> > > > > > > --- a/tcp_splice.c
> > > > > > > +++ b/tcp_splice.c
> > > > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > > > > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > > > > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > > > > > > int never_read = 1;
> > > > > > > - int eof = 0;
> > > > > > >
> > > > > > > while (1) {
> > > > > > > ssize_t readlen, written;
> > > > > > > @@ -510,7 +509,7 @@ retry:
> > > > > > > flow_trace(conn, "%zi from read-side call", readlen);
> > > > > > >
> > > > > > > if (!readlen) {
> > > > > > > - eof = 1;
> > > > > > > + conn_event(conn, FIN_RCVD(fromsidei));
> > > > > >
> > > > > > I'm not sure if I really found a concrete issue with this, but it looks
> > > > > > a bit scary, because it changes the semantics of FIN_RCVD, which used to
> > > > > > mean that we infer we received a FIN, regardless of whether we're done
> > > > > > processing all data from that half of the connection.
> > > > > >
> > > > > > Now FIN_RCVD is only set if we actually processed all the data and we
> > > > > > hit the end of file.
> > > > >
> > > > > True. But the only place that tested FIN_RCVD was at the end of
> > > > > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
> > > > > was the cause of bug202 - we had FIN_RCVD set, but we didn't process
> > > > > it and shutdown() on the other side, because we didn't have eof.
> > > >
> > > > That sounds like a good motivation to clean this up, just two concerns
> > > > below:
> > > >
> > > > > > The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> > > > > > returns -1 with EAGAIN in errno because we had no room in the pipe,
> > > > > > and it would have returned 0 instead.
> > > > > >
> > > > > > Will we ever get our zero-sized "read" later? If not, we might have
> > > > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> > > > > > guarantees in that sense from splice().
> > > > >
> > > > > It's not really about guarantees from splice. I'm pretty sure this is
> > > > > ok, reasoning as follows.
> > > > >
> > > > > Consider all the exit points from the loop body:
> > > > > - Each return is a return -1, so we kill the connection anyway. They
> > > > > don't matter
> > > > > - Each continue, goto retry and the end of the body will do the read
> > > > > side splice() again, so get another chance to see the EOF
> > > > > - That leaves just the breaks
> > > > >
> > > > > Consider each break (there are three, since patch 2 of this series)
> > > > > if (written < 0) {
> > > > > if (!conn->pending[fromsidei])
> > > > > break;
> > > > >
> > > > > (1) The pipe is empty and the write-splice returned EAGAIN, so it
> > > > > didn't remove data from the pipe.
> > > >
> > > > You're assuming that !conn->pending[fromsidei] means that the pipe is
> > > > empty. From what we see of it, it is.
> > >
> > > It does mean the pipe is empty. Everything we put in, we've taken
> > > out. There cannot be anything in there.
> > >
> > > > What the kernel can do with it, though, is different. It might return
> > > > EAGAIN even if we think we should have space, because it's resizing it
> > > > under memory pressure or anything like that. Or it delays freeing up
> > > > space or accounting for whatever reason.
> > >
> > > Theoretically, I suppose. But !pending doesn't just mean the pipe is
> > > not full it means it's completely completely empty. Not being able to
> > > put any bytes at all into an empty pipe would be *very* surprising.
> > > So much so that if it happened in practice, I suspect we wouldn't be
> > > safe not having epoll events on the pipe ends, so that we can be
> > > notified when it deigns to accept some data.
> >
> > We can get 512-byte pipes, I actually saw that happening in practice
> > with either:
>
> Sure.. so? We can still put some bytes into it if it's empty.
The difference between empty and full is pretty small in that case.
> > - people setting low values for ulimits
> >
> > - the user (or just pasta itself) having a lot of pipes open
> >
> > and if I recall correctly that's where I saw the case of a supposedly
> > empty pipe giving us EAGAIN. That was years ago though and I didn't
> > specifically fix that.
>
> I mean.. that sounds like a kernel bug.
fcntl(2) says, for F_SETPIPE_SZ:
Note that because of the way the pages of the pipe buffer are em‐
ployed when data is written to the pipe, the number of bytes that
can be written may be less than the nominal size, depending on
the size of the writes.
...which I kind of understand really.
> If we do have to handle that
> case we'll need epoll events on the pipe ends, since none of the
> socket events we monitor will trigger when the pipe becomes writable.
Well, EPOLLOUT should do it.
> > We currently probe the size based on the value we can have for 32 pipes
> > (TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get
> > rather small pipes.
> >
> > Things might already be broken with them, I haven't checked the
> > behaviour in a long while. I think 512 bytes was the lower bound I hit.
> >
> > > > So it would be nice to make this part robust to that. I thought setting
> > > > FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
> > > >
> > > > > Therefore, the pipe must have been
> > > > > empty before the write-splice. Which means the read-splice can't have
> > > > > blocked on a full pipe.
> > > > > conn_event(conn, OUT_WAIT(!fromsidei));
> > > > > break;
> > > > > }
> > > > >
> > > > > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it
> > > > > must have blocked on the output socket. We've set OUT_WAIT(), so
> > > > > we'll get an EPOLLOUT at some point which will cause us to read-splice
> > > > > again, meaning we get another chance to see the EOF.
> > > >
> > > > ...later. But what if we don't get a zero-sized read *at all*? I'm not
> > > > sure if splice() guarantees we do get one if we reach end-of-file.
> > >
> > > > That's something valid and very well established for read() and recv(),
> > > > but splice() is a bit weird. The documentation says:
> > > >
> > > > A return value of 0 means end of input.
> > > >
> > > > but I wouldn't assume we'll *always* get at least one in case of EOF.
> > >
> > > What else could we plausibly get?
> >
> > -1 with EBADF, probably with EPOLLERR, because something timed out?
>
> EBADF makes no sense, the fds are still valid, even if they're at EOF.
I was thinking we hit EOF, don't notice right away, but after seconds /
minutes and the socket is already closed.
> > But I guess you're right, as long as we're not in the EPOLLERR category
> > of things, we should consistently get 0, even if we read multiple
> > times.
> >
> > I had in mind some kernel behaviour where we get 0 once, and then -1
> > (EAGAIN?) because... go figure. But no, it can't happen.
>
> I think the logic should be ok as long as we see a 0 once, even if we
> get EAGAINs after that.
>
> Another way to look at this - if we had to monitor EPOLLRDHUP to get
> this right, splice() would be unusable from blocking / synchronous
> code, which I don't think is the case.
Right, yes, I'm fairly convinced at this point.
> > > > > [...]
> > > > > if (conn->events & FIN_RCVD(fromsidei))
> > > > > break;
> > > > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> > > > >
> > > > > > The existing implementation distinguishes between end-of-file we hit in
> > > > > > a given iteration, and EPOLLRDHUP we might have seen at any time.
> > > > > > That was actually intended.
> > > > >
> > > > > It might be intended, but I can't see that we did anything with that
> > > > > information.
> > > >
> > > > We always set FIN_RCVD on it. You're right, if we only checked that on
> > > > 'eof', that didn't solve much, but that wasn't necessarily intended. My
> > > > original intention was to make setting of FIN_RCVD (or whatever it was
> > > > originally) robust.
> > >
> > > Ok, well. I've spotted other changes to make in the vicinity that I
> > > think will make some of this easier to reason about anyway. So I'll
> > > consider your points as I rework this and other patches.
> > >
> > > > > That said the conditions on which we exit / retry this loop are pretty
> > > > > darn confusing. I'll see if I can improve them.
--
Stefano
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
2026-05-21 15:18 ` Stefano Brivio
@ 2026-05-22 1:29 ` David Gibson
0 siblings, 0 replies; 27+ messages in thread
From: David Gibson @ 2026-05-22 1:29 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 11702 bytes --]
On Thu, May 21, 2026 at 05:18:16PM +0200, Stefano Brivio wrote:
> On Thu, 21 May 2026 23:51:04 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Thu, May 21, 2026 at 09:15:13AM +0200, Stefano Brivio wrote:
> > > On Thu, 21 May 2026 16:56:43 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote:
> > > > > On Thu, 21 May 2026 12:03:33 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >
> > > > > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> > > > > > > On Wed, 20 May 2026 23:08:50 +1000
> > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > >
> > > > > > > > There are two ways we can tell one of our sockets has received a FIN. We
> > > > > > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > > > > > > > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > > > > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > > > > > > > some other close out logic is based on seeing an EOF.
> > > > > > > >
> > > > > > > > Simplify this by setting the flag based on only the EOF. To make sure we
> > > > > > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > > > > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > > > > > > >
> > > > > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > > ---
> > > > > > > > tcp_splice.c | 14 +++++---------
> > > > > > > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > > > > > > index 8fbd490f..b45f0060 100644
> > > > > > > > --- a/tcp_splice.c
> > > > > > > > +++ b/tcp_splice.c
> > > > > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > > > > > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > > > > > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > > > > > > > int never_read = 1;
> > > > > > > > - int eof = 0;
> > > > > > > >
> > > > > > > > while (1) {
> > > > > > > > ssize_t readlen, written;
> > > > > > > > @@ -510,7 +509,7 @@ retry:
> > > > > > > > flow_trace(conn, "%zi from read-side call", readlen);
> > > > > > > >
> > > > > > > > if (!readlen) {
> > > > > > > > - eof = 1;
> > > > > > > > + conn_event(conn, FIN_RCVD(fromsidei));
> > > > > > >
> > > > > > > I'm not sure if I really found a concrete issue with this, but it looks
> > > > > > > a bit scary, because it changes the semantics of FIN_RCVD, which used to
> > > > > > > mean that we infer we received a FIN, regardless of whether we're done
> > > > > > > processing all data from that half of the connection.
> > > > > > >
> > > > > > > Now FIN_RCVD is only set if we actually processed all the data and we
> > > > > > > hit the end of file.
> > > > > >
> > > > > > True. But the only place that tested FIN_RCVD was at the end of
> > > > > > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
> > > > > > was the cause of bug202 - we had FIN_RCVD set, but we didn't process
> > > > > > it and shutdown() on the other side, because we didn't have eof.
> > > > >
> > > > > That sounds like a good motivation to clean this up, just two concerns
> > > > > below:
> > > > >
> > > > > > > The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> > > > > > > returns -1 with EAGAIN in errno because we had no room in the pipe,
> > > > > > > and it would have returned 0 instead.
> > > > > > >
> > > > > > > Will we ever get our zero-sized "read" later? If not, we might have
> > > > > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> > > > > > > guarantees in that sense from splice().
> > > > > >
> > > > > > It's not really about guarantees from splice. I'm pretty sure this is
> > > > > > ok, reasoning as follows.
> > > > > >
> > > > > > Consider all the exit points from the loop body:
> > > > > > - Each return is a return -1, so we kill the connection anyway. They
> > > > > > don't matter
> > > > > > - Each continue, goto retry and the end of the body will do the read
> > > > > > side splice() again, so get another chance to see the EOF
> > > > > > - That leaves just the breaks
> > > > > >
> > > > > > Consider each break (there are three, since patch 2 of this series)
> > > > > > if (written < 0) {
> > > > > > if (!conn->pending[fromsidei])
> > > > > > break;
> > > > > >
> > > > > > (1) The pipe is empty and the write-splice returned EAGAIN, so it
> > > > > > didn't remove data from the pipe.
> > > > >
> > > > > You're assuming that !conn->pending[fromsidei] means that the pipe is
> > > > > empty. From what we see of it, it is.
> > > >
> > > > It does mean the pipe is empty. Everything we put in, we've taken
> > > > out. There cannot be anything in there.
> > > >
> > > > > What the kernel can do with it, though, is different. It might return
> > > > > EAGAIN even if we think we should have space, because it's resizing it
> > > > > under memory pressure or anything like that. Or it delays freeing up
> > > > > space or accounting for whatever reason.
> > > >
> > > > Theoretically, I suppose. But !pending doesn't just mean the pipe is
> > > > not full it means it's completely completely empty. Not being able to
> > > > put any bytes at all into an empty pipe would be *very* surprising.
> > > > So much so that if it happened in practice, I suspect we wouldn't be
> > > > safe not having epoll events on the pipe ends, so that we can be
> > > > notified when it deigns to accept some data.
> > >
> > > We can get 512-byte pipes, I actually saw that happening in practice
> > > with either:
> >
> > Sure.. so? We can still put some bytes into it if it's empty.
>
> The difference between empty and full is pretty small in that case.
Small, but not nothing. The empty pipe can still accept *some* bytes
- otherwise the pipe is useless. Accept any bytes and the reasoning
above works.
> > > - people setting low values for ulimits
> > >
> > > - the user (or just pasta itself) having a lot of pipes open
> > >
> > > and if I recall correctly that's where I saw the case of a supposedly
> > > empty pipe giving us EAGAIN. That was years ago though and I didn't
> > > specifically fix that.
> >
> > I mean.. that sounds like a kernel bug.
>
> fcntl(2) says, for F_SETPIPE_SZ:
>
> Note that because of the way the pages of the pipe buffer are em‐
> ployed when data is written to the pipe, the number of bytes that
> can be written may be less than the nominal size, depending on
> the size of the writes.
>
> ...which I kind of understand really.
Ok, but still if an empty pipe can't accept *any* bytes, it's useless,
which would make that a kernel bug.
> > If we do have to handle that
> > case we'll need epoll events on the pipe ends, since none of the
> > socket events we monitor will trigger when the pipe becomes writable.
>
> Well, EPOLLOUT should do it.
But EPOLLOUT on the pipe end itself, not just on the write side socket
as we have now.
> > > We currently probe the size based on the value we can have for 32 pipes
> > > (TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get
> > > rather small pipes.
> > >
> > > Things might already be broken with them, I haven't checked the
> > > behaviour in a long while. I think 512 bytes was the lower bound I hit.
> > >
> > > > > So it would be nice to make this part robust to that. I thought setting
> > > > > FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
> > > > >
> > > > > > Therefore, the pipe must have been
> > > > > > empty before the write-splice. Which means the read-splice can't have
> > > > > > blocked on a full pipe.
> > > > > > conn_event(conn, OUT_WAIT(!fromsidei));
> > > > > > break;
> > > > > > }
> > > > > >
> > > > > > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it
> > > > > > must have blocked on the output socket. We've set OUT_WAIT(), so
> > > > > > we'll get an EPOLLOUT at some point which will cause us to read-splice
> > > > > > again, meaning we get another chance to see the EOF.
> > > > >
> > > > > ...later. But what if we don't get a zero-sized read *at all*? I'm not
> > > > > sure if splice() guarantees we do get one if we reach end-of-file.
> > > >
> > > > > That's something valid and very well established for read() and recv(),
> > > > > but splice() is a bit weird. The documentation says:
> > > > >
> > > > > A return value of 0 means end of input.
> > > > >
> > > > > but I wouldn't assume we'll *always* get at least one in case of EOF.
> > > >
> > > > What else could we plausibly get?
> > >
> > > -1 with EBADF, probably with EPOLLERR, because something timed out?
> >
> > EBADF makes no sense, the fds are still valid, even if they're at EOF.
>
> I was thinking we hit EOF, don't notice right away, but after seconds /
> minutes and the socket is already closed.
The only place we close the socket is in the flow close path, at which
point we've already decided it's over so the whole question is moot.
> > > But I guess you're right, as long as we're not in the EPOLLERR category
> > > of things, we should consistently get 0, even if we read multiple
> > > times.
> > >
> > > I had in mind some kernel behaviour where we get 0 once, and then -1
> > > (EAGAIN?) because... go figure. But no, it can't happen.
> >
> > I think the logic should be ok as long as we see a 0 once, even if we
> > get EAGAINs after that.
> >
> > Another way to look at this - if we had to monitor EPOLLRDHUP to get
> > this right, splice() would be unusable from blocking / synchronous
> > code, which I don't think is the case.
>
> Right, yes, I'm fairly convinced at this point.
Ok :).
> > > > > > [...]
> > > > > > if (conn->events & FIN_RCVD(fromsidei))
> > > > > > break;
> > > > > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> > > > > >
> > > > > > > The existing implementation distinguishes between end-of-file we hit in
> > > > > > > a given iteration, and EPOLLRDHUP we might have seen at any time.
> > > > > > > That was actually intended.
> > > > > >
> > > > > > It might be intended, but I can't see that we did anything with that
> > > > > > information.
> > > > >
> > > > > We always set FIN_RCVD on it. You're right, if we only checked that on
> > > > > 'eof', that didn't solve much, but that wasn't necessarily intended. My
> > > > > original intention was to make setting of FIN_RCVD (or whatever it was
> > > > > originally) robust.
> > > >
> > > > Ok, well. I've spotted other changes to make in the vicinity that I
> > > > think will make some of this easier to reason about anyway. So I'll
> > > > consider your points as I rework this and other patches.
> > > >
> > > > > > That said the conditions on which we exit / retry this loop are pretty
> > > > > > darn confusing. I'll see if I can improve them.
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2026-05-22 1:30 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
2026-05-20 14:31 ` Stefano Brivio
2026-05-21 0:43 ` David Gibson
2026-05-21 5:08 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:46 ` David Gibson
2026-05-20 13:08 ` [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:50 ` David Gibson
2026-05-20 13:08 ` [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes David Gibson
2026-05-20 20:29 ` Stefano Brivio
2026-05-21 0:54 ` David Gibson
2026-05-20 13:08 ` [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:03 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
2026-05-21 6:56 ` David Gibson
2026-05-21 7:15 ` Stefano Brivio
2026-05-21 13:51 ` David Gibson
2026-05-21 15:18 ` Stefano Brivio
2026-05-22 1:29 ` David Gibson
2026-05-20 13:08 ` [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling David Gibson
2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:11 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).