* [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels
@ 2025-09-09 18:16 Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 1/8] tcp: FIN flags have to be retransmitted as well Stefano Brivio
` (8 more replies)
0 siblings, 9 replies; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger
sk_rcvbuf checks"), window limits are enforced more aggressively with
a bigger amount of zero-window updates compared to what happened with
e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone,
and occasional duplicate ACKs can now be seen also for local transfers
with default (208 KiB) socket buffer sizes.
Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the
pasta integration occasionally fail on the "TCP/IPv4 large transfer,
tap" case.
While playing with a reproducer that seems to be matching those
failures:
while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
and a kernel including that commit, I hit a few different failures,
that should be fixed by this series.
Paul tested v1 of this series and found an additional failure
(transfer timeout), which I could reproduce with a slightly different
command:
while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
where write.sh is simply:
#!/bin/sh
cat > /tmp/large.rcv
so that the connection is not half-closed starting from the beginning,
because socat can't make assumptions about the unidirectional nature
of the traffic. This should now be fixed as well by the new version of
patch 3/8.
v4:
- add patch 8/8
v3:
- add patch 6/7
- in 7/7, check dlen <= 1 for keep-alive segments, instead of len <= 1
v2: in 3/6, rewind sequence also if the zero-window update comes in
the middle of a batch with non-zero window updates
Stefano Brivio (8):
tcp: FIN flags have to be retransmitted as well
tcp: Factor sequence rewind for retransmissions into a new function
tcp: Rewind sequence when guest shrinks window to zero
tcp: Fix closing logic for half-closed connections
tcp: Don't try to transmit right after the peer shrank the window to
zero
tcp: Cast operands of sequence comparison macros to uint32_t before
using them
tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path
consistent
tcp: Don't send FIN segment to guest yet if we have pending
unacknowledged data
tcp.c | 142 ++++++++++++++++++++++++++++++++++++++-----------
tcp_buf.c | 5 +-
tcp_internal.h | 12 +++--
3 files changed, 122 insertions(+), 37 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 1/8] tcp: FIN flags have to be retransmitted as well
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 2/8] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
` (7 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
If we're retransmitting any data, and we sent a FIN segment to our
peer, regardless of whether it was received, we obviously have to
retransmit it as well, given that it can only come with the last data
segment, or after it.
Unconditionally clear the internal TAP_FIN_SENT flag whenever we
re-transmit, so that we know we have to send it again, in case.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tcp.c b/tcp.c
index 253cdb3..9b7a9b0 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1778,6 +1778,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
max_ack_seq, conn->seq_to_tap);
conn->seq_to_tap = max_ack_seq;
+ conn->events &= ~TAP_FIN_SENT;
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
@@ -2312,6 +2313,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
flow_dbg(conn, "ACK timeout, retry");
conn->retrans++;
conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
if (tcp_set_peek_offset(conn, 0)) {
--
@@ -1778,6 +1778,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
max_ack_seq, conn->seq_to_tap);
conn->seq_to_tap = max_ack_seq;
+ conn->events &= ~TAP_FIN_SENT;
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
@@ -2312,6 +2313,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
flow_dbg(conn, "ACK timeout, retry");
conn->retrans++;
conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
if (tcp_set_peek_offset(conn, 0)) {
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v4 2/8] tcp: Factor sequence rewind for retransmissions into a new function
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 1/8] tcp: FIN flags have to be retransmitted as well Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
` (6 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
...as I'm going to need a third occurrence of this in the next change.
This introduces a small functional change in tcp_data_from_tap(): the
sequence was previously rewound to the highest ACK number we found in
the current packet batch, and not to the current value of
seq_ack_from_tap.
The two might differ in case tcp_sock_consume() failed, because in
that case we're ignoring that ACK altogether. But if we're ignoring
it, it looks more correct to me to start retransmitting from an
earlier sequence anyway.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 47 ++++++++++++++++++++++++++++++++---------------
1 file changed, 32 insertions(+), 15 deletions(-)
diff --git a/tcp.c b/tcp.c
index 9b7a9b0..86e08f1 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1107,6 +1107,26 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
}
}
+/**
+ * tcp_rewind_seq() - Rewind sequence to tap and socket offset to current ACK
+ * @c: Execution context
+ * @conn: Connection pointer
+ *
+ * Return: 0 on success, -1 on failure, with connection reset
+ */
+static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn)
+{
+ conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
+
+ if (tcp_set_peek_offset(conn, 0)) {
+ tcp_rst(c, conn);
+ return -1;
+ }
+
+ return 0;
+}
+
/**
* tcp_prepare_flags() - Prepare header for flags-only segment (no payload)
* @c: Execution context
@@ -1776,13 +1796,11 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (retr) {
flow_trace(conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
- max_ack_seq, conn->seq_to_tap);
- conn->seq_to_tap = max_ack_seq;
- conn->events &= ~TAP_FIN_SENT;
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
+ conn->seq_ack_from_tap, conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
return -1;
- }
+
tcp_data_from_sock(c, conn);
}
@@ -2311,17 +2329,16 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
tcp_rst(c, conn);
} else {
flow_dbg(conn, "ACK timeout, retry");
- conn->retrans++;
- conn->seq_to_tap = conn->seq_ack_from_tap;
- conn->events &= ~TAP_FIN_SENT;
+
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
- } else {
- tcp_data_from_sock(c, conn);
- tcp_timer_ctl(c, conn);
- }
+
+ conn->retrans++;
+ if (tcp_rewind_seq(c, conn))
+ return;
+
+ tcp_data_from_sock(c, conn);
+ tcp_timer_ctl(c, conn);
}
} else {
struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
--
@@ -1107,6 +1107,26 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
}
}
+/**
+ * tcp_rewind_seq() - Rewind sequence to tap and socket offset to current ACK
+ * @c: Execution context
+ * @conn: Connection pointer
+ *
+ * Return: 0 on success, -1 on failure, with connection reset
+ */
+static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn)
+{
+ conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
+
+ if (tcp_set_peek_offset(conn, 0)) {
+ tcp_rst(c, conn);
+ return -1;
+ }
+
+ return 0;
+}
+
/**
* tcp_prepare_flags() - Prepare header for flags-only segment (no payload)
* @c: Execution context
@@ -1776,13 +1796,11 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (retr) {
flow_trace(conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
- max_ack_seq, conn->seq_to_tap);
- conn->seq_to_tap = max_ack_seq;
- conn->events &= ~TAP_FIN_SENT;
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
+ conn->seq_ack_from_tap, conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
return -1;
- }
+
tcp_data_from_sock(c, conn);
}
@@ -2311,17 +2329,16 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
tcp_rst(c, conn);
} else {
flow_dbg(conn, "ACK timeout, retry");
- conn->retrans++;
- conn->seq_to_tap = conn->seq_ack_from_tap;
- conn->events &= ~TAP_FIN_SENT;
+
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
- } else {
- tcp_data_from_sock(c, conn);
- tcp_timer_ctl(c, conn);
- }
+
+ conn->retrans++;
+ if (tcp_rewind_seq(c, conn))
+ return;
+
+ tcp_data_from_sock(c, conn);
+ tcp_timer_ctl(c, conn);
}
} else {
struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 1/8] tcp: FIN flags have to be retransmitted as well Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 2/8] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-10 2:20 ` David Gibson
2025-09-09 18:16 ` [PATCH v4 4/8] tcp: Fix closing logic for half-closed connections Stefano Brivio
` (5 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
A window shrunk to zero means by definition that anything else that
might be in flight is now out of window. Restart from the currently
acknowledged sequence.
We need to do that both in tcp_tap_window_update(), where we already
check for zero-window updates, as well as in tcp_data_from_tap(),
because we might get one of those updates in a batch of packets that
also contains a non-zero window update.
Suggested-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/tcp.c b/tcp.c
index 86e08f1..12d42e0 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1268,19 +1268,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
/**
* tcp_tap_window_update() - Process an updated window from tap side
+ * @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
*/
-static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
+static void tcp_tap_window_update(const struct ctx *c,
+ struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
/* Work-around for bug introduced in peer kernel code, commit
- * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
- * We don't update if window shrank to zero.
+ * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
+ * update the window if it shrank to zero, so that we'll eventually
+ * retry to send data, but rewind the sequence as that obviously implies
+ * that no data beyond the updated window will ever be acknowledged.
*/
- if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
+ if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
+ tcp_rewind_seq(c, conn);
return;
+ }
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
@@ -1709,7 +1715,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
tcp_timer_ctl(c, conn);
if (p->count == 1) {
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn,
+ ntohs(th->window));
return 1;
}
@@ -1728,6 +1735,15 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
ack_seq == max_ack_seq &&
ntohs(th->window) == max_ack_seq_wnd;
+ /* See tcp_tap_window_update() for details. On
+ * top of that, we also need to check here if a
+ * zero-window update is contained in a batch of
+ * packets that includes a non-zero window as
+ * well.
+ */
+ if (!ntohs(th->window))
+ tcp_rewind_seq(c, conn);
+
max_ack_seq_wnd = ntohs(th->window);
max_ack_seq = ack_seq;
}
@@ -1791,7 +1807,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (ack && !tcp_sock_consume(conn, max_ack_seq))
tcp_update_seqack_from_tap(c, conn, max_ack_seq);
- tcp_tap_window_update(conn, max_ack_seq_wnd);
+ tcp_tap_window_update(c, conn, max_ack_seq_wnd);
if (retr) {
flow_trace(conn,
@@ -1880,7 +1896,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
const struct tcphdr *th,
const char *opts, size_t optlen)
{
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_get_tap_ws(conn, opts, optlen);
/* First value is not scaled */
@@ -2085,7 +2101,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
@@ -2097,7 +2113,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
if (conn->events & SOCK_FIN_RCVD &&
--
@@ -1268,19 +1268,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
/**
* tcp_tap_window_update() - Process an updated window from tap side
+ * @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
*/
-static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
+static void tcp_tap_window_update(const struct ctx *c,
+ struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
/* Work-around for bug introduced in peer kernel code, commit
- * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
- * We don't update if window shrank to zero.
+ * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
+ * update the window if it shrank to zero, so that we'll eventually
+ * retry to send data, but rewind the sequence as that obviously implies
+ * that no data beyond the updated window will ever be acknowledged.
*/
- if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
+ if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
+ tcp_rewind_seq(c, conn);
return;
+ }
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
@@ -1709,7 +1715,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
tcp_timer_ctl(c, conn);
if (p->count == 1) {
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn,
+ ntohs(th->window));
return 1;
}
@@ -1728,6 +1735,15 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
ack_seq == max_ack_seq &&
ntohs(th->window) == max_ack_seq_wnd;
+ /* See tcp_tap_window_update() for details. On
+ * top of that, we also need to check here if a
+ * zero-window update is contained in a batch of
+ * packets that includes a non-zero window as
+ * well.
+ */
+ if (!ntohs(th->window))
+ tcp_rewind_seq(c, conn);
+
max_ack_seq_wnd = ntohs(th->window);
max_ack_seq = ack_seq;
}
@@ -1791,7 +1807,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (ack && !tcp_sock_consume(conn, max_ack_seq))
tcp_update_seqack_from_tap(c, conn, max_ack_seq);
- tcp_tap_window_update(conn, max_ack_seq_wnd);
+ tcp_tap_window_update(c, conn, max_ack_seq_wnd);
if (retr) {
flow_trace(conn,
@@ -1880,7 +1896,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
const struct tcphdr *th,
const char *opts, size_t optlen)
{
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_get_tap_ws(conn, opts, optlen);
/* First value is not scaled */
@@ -2085,7 +2101,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
@@ -2097,7 +2113,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
if (conn->events & SOCK_FIN_RCVD &&
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v4 4/8] tcp: Fix closing logic for half-closed connections
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (2 preceding siblings ...)
2025-09-09 18:16 ` [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 5/8] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
` (4 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
First off, don't close connections half-closed by the guest before
our own FIN is acknowledged by the guest itself.
That is, after we receive a FIN from the guest (TAP_FIN_RCVD), if we
don't have any data left to send from the socket (SOCK_FIN_RCVD, or
EPOLLHUP), we send a FIN segment to the guest (TAP_FIN_SENT), but we
need to actually have it acknowledged (and have no pending
retransmissions) before we can close the connection: check for
TAP_FIN_ACKED, first.
Then, if we set TAP_FIN_SENT, and we receive an ACK segment from the
guest, set TAP_FIN_ACKED. This was entirely missing for the
TAP_FIN_RCVD case, and as we fix the problem described above, this
becomes relevant as well.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/tcp.c b/tcp.c
index 12d42e0..b83510b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2116,9 +2116,14 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
- if (conn->events & SOCK_FIN_RCVD &&
- conn->seq_ack_from_tap == conn->seq_to_tap)
- conn_event(c, conn, CLOSED);
+ if (conn->seq_ack_from_tap == conn->seq_to_tap) {
+ if (th->ack && conn->events & TAP_FIN_SENT)
+ conn_event(c, conn, TAP_FIN_ACKED);
+
+ if (conn->events & SOCK_FIN_RCVD &&
+ conn->events & TAP_FIN_ACKED)
+ conn_event(c, conn, CLOSED);
+ }
return 1;
}
@@ -2398,7 +2403,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
return;
}
- if ((conn->events & TAP_FIN_SENT) && (events & EPOLLHUP)) {
+ if ((conn->events & TAP_FIN_ACKED) && (events & EPOLLHUP)) {
conn_event(c, conn, CLOSED);
return;
}
--
@@ -2116,9 +2116,14 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
- if (conn->events & SOCK_FIN_RCVD &&
- conn->seq_ack_from_tap == conn->seq_to_tap)
- conn_event(c, conn, CLOSED);
+ if (conn->seq_ack_from_tap == conn->seq_to_tap) {
+ if (th->ack && conn->events & TAP_FIN_SENT)
+ conn_event(c, conn, TAP_FIN_ACKED);
+
+ if (conn->events & SOCK_FIN_RCVD &&
+ conn->events & TAP_FIN_ACKED)
+ conn_event(c, conn, CLOSED);
+ }
return 1;
}
@@ -2398,7 +2403,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
return;
}
- if ((conn->events & TAP_FIN_SENT) && (events & EPOLLHUP)) {
+ if ((conn->events & TAP_FIN_ACKED) && (events & EPOLLHUP)) {
conn_event(c, conn, CLOSED);
return;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v4 5/8] tcp: Don't try to transmit right after the peer shrank the window to zero
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (3 preceding siblings ...)
2025-09-09 18:16 ` [PATCH v4 4/8] tcp: Fix closing logic for half-closed connections Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 6/8] tcp: Cast operands of sequence comparison macros to uint32_t before using them Stefano Brivio
` (3 subsequent siblings)
8 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
If the peer shrinks the window to zero, we'll skip storing the new
window, as a convenient way to cause window probes (which exceed any
zero-sized window, strictly speaking) if we don't get window updates
in a while.
As we do so, though, we need to ensure we don't try to queue more data
from the socket right after we process this window update, as the
entire point of a zero-window advertisement is to keep us from sending
more data.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/tcp.c b/tcp.c
index b83510b..9c70a25 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1271,8 +1271,10 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
* @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
+ *
+ * Return: false on zero window (not stored to wnd_from_tap), true otherwise
*/
-static void tcp_tap_window_update(const struct ctx *c,
+static bool tcp_tap_window_update(const struct ctx *c,
struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
@@ -1285,13 +1287,14 @@ static void tcp_tap_window_update(const struct ctx *c,
*/
if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
tcp_rewind_seq(c, conn);
- return;
+ return false;
}
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
/* FIXME: reflect the tap-side receiver's window back to the sock-side
* sender by adjusting SO_RCVBUF? */
+ return true;
}
/**
@@ -2101,9 +2104,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(c, conn, ntohs(th->window));
-
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (p->count - idx == 1)
return 1;
@@ -2113,8 +2115,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(c, conn, ntohs(th->window));
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
if (th->ack && conn->events & TAP_FIN_SENT)
--
@@ -1271,8 +1271,10 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
* @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
+ *
+ * Return: false on zero window (not stored to wnd_from_tap), true otherwise
*/
-static void tcp_tap_window_update(const struct ctx *c,
+static bool tcp_tap_window_update(const struct ctx *c,
struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
@@ -1285,13 +1287,14 @@ static void tcp_tap_window_update(const struct ctx *c,
*/
if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
tcp_rewind_seq(c, conn);
- return;
+ return false;
}
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
/* FIXME: reflect the tap-side receiver's window back to the sock-side
* sender by adjusting SO_RCVBUF? */
+ return true;
}
/**
@@ -2101,9 +2104,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(c, conn, ntohs(th->window));
-
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (p->count - idx == 1)
return 1;
@@ -2113,8 +2115,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(c, conn, ntohs(th->window));
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
if (th->ack && conn->events & TAP_FIN_SENT)
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v4 6/8] tcp: Cast operands of sequence comparison macros to uint32_t before using them
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (4 preceding siblings ...)
2025-09-09 18:16 ` [PATCH v4 5/8] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-10 2:21 ` David Gibson
2025-09-09 18:16 ` [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
` (2 subsequent siblings)
8 siblings, 1 reply; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
Otherwise, passing signed types causes automatic promotion of the
result of the subtractions as well, which is not what we want, as
these macros rely on unsigned 32-bit arithmetic.
The next patch introduces a ssize_t operand for SEQ_LE, illustrating
the issue.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp_internal.h | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/tcp_internal.h b/tcp_internal.h
index d0009f8..ce6fee2 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -21,10 +21,14 @@
sizeof(struct ipv6hdr), \
sizeof(uint32_t))
-#define SEQ_LE(a, b) ((b) - (a) < MAX_WINDOW)
-#define SEQ_LT(a, b) ((b) - (a) - 1 < MAX_WINDOW)
-#define SEQ_GE(a, b) ((a) - (b) < MAX_WINDOW)
-#define SEQ_GT(a, b) ((a) - (b) - 1 < MAX_WINDOW)
+#define SEQ_LE(a, b) \
+ ((uint32_t)(b) - (uint32_t)(a) < MAX_WINDOW)
+#define SEQ_LT(a, b) \
+ ((uint32_t)(b) - (uint32_t)(a) - 1 < MAX_WINDOW)
+#define SEQ_GE(a, b) \
+ ((uint32_t)(a) - (uint32_t)(b) < MAX_WINDOW)
+#define SEQ_GT(a, b) \
+ ((uint32_t)(a) - (uint32_t)(b) - 1 < MAX_WINDOW)
#define FIN (1 << 0)
#define SYN (1 << 1)
--
@@ -21,10 +21,14 @@
sizeof(struct ipv6hdr), \
sizeof(uint32_t))
-#define SEQ_LE(a, b) ((b) - (a) < MAX_WINDOW)
-#define SEQ_LT(a, b) ((b) - (a) - 1 < MAX_WINDOW)
-#define SEQ_GE(a, b) ((a) - (b) < MAX_WINDOW)
-#define SEQ_GT(a, b) ((a) - (b) - 1 < MAX_WINDOW)
+#define SEQ_LE(a, b) \
+ ((uint32_t)(b) - (uint32_t)(a) < MAX_WINDOW)
+#define SEQ_LT(a, b) \
+ ((uint32_t)(b) - (uint32_t)(a) - 1 < MAX_WINDOW)
+#define SEQ_GE(a, b) \
+ ((uint32_t)(a) - (uint32_t)(b) < MAX_WINDOW)
+#define SEQ_GT(a, b) \
+ ((uint32_t)(a) - (uint32_t)(b) - 1 < MAX_WINDOW)
#define FIN (1 << 0)
#define SYN (1 << 1)
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (5 preceding siblings ...)
2025-09-09 18:16 ` [PATCH v4 6/8] tcp: Cast operands of sequence comparison macros to uint32_t before using them Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-10 2:27 ` David Gibson
2025-09-09 18:16 ` [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data Stefano Brivio
2025-09-10 9:10 ` [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger
8 siblings, 1 reply; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
We currently have a number of discrepancies in the tcp_tap_handler()
path between the half-closed connection path and the regular one, and
they are mostly a result of code duplication, which comes in turn from
the fact that tcp_data_from_tap() deals with data transfers as well as
general connection bookkeeping, so we can't use it for half-closed
connections.
This suggests that we should probably rework it into two or more
functions, in the long term, but for the moment being I'm just fixing
one obvious issue, which is the lack of fast retransmissions in the
TAP_FIN_RCVD path, and a potential one, which is the fact we don't
handle socket flush failures.
Add fast re-transmit for half-closed connections, and handle the case
of socket flush (tcp_sock_consume()) flush failure in the same way as
tcp_data_from_tap() handles it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 42 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/tcp.c b/tcp.c
index 9c70a25..5163dbf 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1652,6 +1652,23 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
return tcp_buf_data_from_sock(c, conn);
}
+/**
+ * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet
+ * @th: Pointer to TCP header
+ * @l4len: TCP packet length, including TCP header
+ *
+ * Return: data length of TCP packet, -1 on invalid value of Data Offset field
+ */
+static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len)
+{
+ size_t off = th->doff * 4UL;
+
+ if (off < sizeof(*th) || off > l4len)
+ return -1;
+
+ return l4len - off;
+}
+
/**
* tcp_data_from_tap() - tap/guest data for established connection
* @c: Execution context
@@ -2113,9 +2130,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
/* Established connections not accepting data from tap */
if (conn->events & TAP_FIN_RCVD) {
- tcp_sock_consume(conn, ntohl(th->ack_seq));
- tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ bool retr;
+
+ retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin &&
+ ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
+ ntohs(th->window) == conn->wnd_from_tap;
+
+ /* On socket flush failure, pretend there was no ACK, try again
+ * later
+ */
+ if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq)))
+ tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
+
+ if (retr) {
+ flow_trace(conn,
+ "fast re-transmit, ACK: %u, previous sequence: %u",
+ ntohl(th->ack_seq), conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
+ return -1;
+ }
+
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr)
tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
--
@@ -1652,6 +1652,23 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
return tcp_buf_data_from_sock(c, conn);
}
+/**
+ * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet
+ * @th: Pointer to TCP header
+ * @l4len: TCP packet length, including TCP header
+ *
+ * Return: data length of TCP packet, -1 on invalid value of Data Offset field
+ */
+static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len)
+{
+ size_t off = th->doff * 4UL;
+
+ if (off < sizeof(*th) || off > l4len)
+ return -1;
+
+ return l4len - off;
+}
+
/**
* tcp_data_from_tap() - tap/guest data for established connection
* @c: Execution context
@@ -2113,9 +2130,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
/* Established connections not accepting data from tap */
if (conn->events & TAP_FIN_RCVD) {
- tcp_sock_consume(conn, ntohl(th->ack_seq));
- tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ bool retr;
+
+ retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin &&
+ ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
+ ntohs(th->window) == conn->wnd_from_tap;
+
+ /* On socket flush failure, pretend there was no ACK, try again
+ * later
+ */
+ if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq)))
+ tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
+
+ if (retr) {
+ flow_trace(conn,
+ "fast re-transmit, ACK: %u, previous sequence: %u",
+ ntohl(th->ack_seq), conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
+ return -1;
+ }
+
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr)
tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (6 preceding siblings ...)
2025-09-09 18:16 ` [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
@ 2025-09-09 18:16 ` Stefano Brivio
2025-09-10 2:29 ` David Gibson
2025-09-10 9:10 ` [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger
8 siblings, 1 reply; 18+ messages in thread
From: Stefano Brivio @ 2025-09-09 18:16 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger, David Gibson
For some reason, tcp_vu_data_from_sock() already takes care of this,
but the non-vhost-user version ignores this possibility and just sends
out a FIN segment whenever we infer we received one host-side,
regardless of the fact that we might have unacknowledged data still to
send.
Somewhat surprisingly, this didn't cause any issue to be reported yet,
until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
came around, leading to the following report from Paul, who hit this
running Podman tests:
439 0.033032 169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480
440 0.033055 169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264
we're sending data to the container, up to the edge of the window
441 0.033059 192.168.122.100 → 169.254.1.2 60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0
and the container acknowledges it
442 0.033091 169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656
we send more data, all we possibly can, in window
443 0.033126 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
and the container shrinks the window due to the issue introduced
by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no
memory"). With a previous patch from this series, we rewind the
sequence, meaning that we assign conn->seq_to_tap from
conn->seq_ack_from_tap, so that we'll retransmit this segment, by
reading again from the socket, and increasing conn->seq_to_tap
once more.
However:
444 0.033144 169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0
we eventually get a zero-length read from the socket and we miss the
fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a
FIN flag with the most recent sequence. The kernel insists:
445 0.033147 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
with its buggy zero-window update, but:
446 0.033152 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0
447 0.033202 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0
we don't reset the TAP_FIN_SENT flag anymore, and don't resend
the FIN segment (nor data), as we already rewound the sequence
earlier.
To solve this, hold off the FIN segment until we get a zero-length
read from the socket *and* we know that there's no unacknowledged
pending data, also without vhost-user, in tcp_buf_data_from_sock().
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp_buf.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tcp_buf.c b/tcp_buf.c
index 4ebb013..49bddbe 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -363,7 +363,10 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
}
if (!len) {
- if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
+ if (already_sent) {
+ conn_flag(c, conn, STALLED);
+ } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
+ SOCK_FIN_RCVD) {
int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
if (ret) {
tcp_rst(c, conn);
--
@@ -363,7 +363,10 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
}
if (!len) {
- if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
+ if (already_sent) {
+ conn_flag(c, conn, STALLED);
+ } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
+ SOCK_FIN_RCVD) {
int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
if (ret) {
tcp_rst(c, conn);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero
2025-09-09 18:16 ` [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
@ 2025-09-10 2:20 ` David Gibson
2025-09-10 6:37 ` Stefano Brivio
0 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2025-09-10 2:20 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 5240 bytes --]
On Tue, Sep 09, 2025 at 08:16:50PM +0200, Stefano Brivio wrote:
> A window shrunk to zero means by definition that anything else that
> might be in flight is now out of window. Restart from the currently
> acknowledged sequence.
>
> We need to do that both in tcp_tap_window_update(), where we already
> check for zero-window updates, as well as in tcp_data_from_tap(),
> because we might get one of those updates in a batch of packets that
> also contains a non-zero window update.
>
> Suggested-by: Jon Maloy <jmaloy@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Though a couple of documentation nits below.
> ---
> tcp.c | 34 +++++++++++++++++++++++++---------
> 1 file changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 86e08f1..12d42e0 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1268,19 +1268,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
>
> /**
> * tcp_tap_window_update() - Process an updated window from tap side
> + * @c: Execution context
> * @conn: Connection pointer
> * @wnd: Window value, host order, unscaled
> */
> -static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> +static void tcp_tap_window_update(const struct ctx *c,
> + struct tcp_tap_conn *conn, unsigned wnd)
> {
> wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
>
> /* Work-around for bug introduced in peer kernel code, commit
> - * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
> - * We don't update if window shrank to zero.
> + * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
> + * update the window if it shrank to zero, so that we'll eventually
> + * retry to send data, but rewind the sequence as that obviously implies
> + * that no data beyond the updated window will ever be acknowledged.
As noted earlier "will ever be acknowledged" might be a bit
misleading. Maybe "no data beyond the window will be acknowledged
until it is retransmitted".
> */
> - if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
> + if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
> + tcp_rewind_seq(c, conn);
> return;
> + }
>
> conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
>
> @@ -1709,7 +1715,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> tcp_timer_ctl(c, conn);
>
> if (p->count == 1) {
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn,
> + ntohs(th->window));
> return 1;
> }
>
> @@ -1728,6 +1735,15 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> ack_seq == max_ack_seq &&
> ntohs(th->window) == max_ack_seq_wnd;
>
> + /* See tcp_tap_window_update() for details. On
> + * top of that, we also need to check here if a
> + * zero-window update is contained in a batch of
> + * packets that includes a non-zero window as
> + * well.
I'm not 100% convinced of this reasoning. But at worst this should
result in some unnecessary but mostly harmless retransmits, and it
seems to fix the problem empirically, so I'm not suggesting changing
it at this time.
> + */
> + if (!ntohs(th->window))
> + tcp_rewind_seq(c, conn);
> +
> max_ack_seq_wnd = ntohs(th->window);
> max_ack_seq = ack_seq;
> }
> @@ -1791,7 +1807,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> if (ack && !tcp_sock_consume(conn, max_ack_seq))
> tcp_update_seqack_from_tap(c, conn, max_ack_seq);
>
> - tcp_tap_window_update(conn, max_ack_seq_wnd);
> + tcp_tap_window_update(c, conn, max_ack_seq_wnd);
>
> if (retr) {
> flow_trace(conn,
> @@ -1880,7 +1896,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
> const struct tcphdr *th,
> const char *opts, size_t optlen)
> {
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn, ntohs(th->window));
> tcp_get_tap_ws(conn, opts, optlen);
>
> /* First value is not scaled */
> @@ -2085,7 +2101,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> if (!th->ack)
> goto reset;
>
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn, ntohs(th->window));
>
> tcp_data_from_sock(c, conn);
>
> @@ -2097,7 +2113,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> if (conn->events & TAP_FIN_RCVD) {
> tcp_sock_consume(conn, ntohl(th->ack_seq));
> tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn, ntohs(th->window));
> tcp_data_from_sock(c, conn);
>
> if (conn->events & SOCK_FIN_RCVD &&
> --
> 2.43.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 6/8] tcp: Cast operands of sequence comparison macros to uint32_t before using them
2025-09-09 18:16 ` [PATCH v4 6/8] tcp: Cast operands of sequence comparison macros to uint32_t before using them Stefano Brivio
@ 2025-09-10 2:21 ` David Gibson
0 siblings, 0 replies; 18+ messages in thread
From: David Gibson @ 2025-09-10 2:21 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 1732 bytes --]
On Tue, Sep 09, 2025 at 08:16:53PM +0200, Stefano Brivio wrote:
> Otherwise, passing signed types causes automatic promotion of the
> result of the subtractions as well, which is not what we want, as
> these macros rely on unsigned 32-bit arithmetic.
>
> The next patch introduces a ssize_t operand for SEQ_LE, illustrating
> the issue.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
C macros and promotion rules strike again :/. Personally I'd probably
switch to inline functions, but this is fine too.
> ---
> tcp_internal.h | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/tcp_internal.h b/tcp_internal.h
> index d0009f8..ce6fee2 100644
> --- a/tcp_internal.h
> +++ b/tcp_internal.h
> @@ -21,10 +21,14 @@
> sizeof(struct ipv6hdr), \
> sizeof(uint32_t))
>
> -#define SEQ_LE(a, b) ((b) - (a) < MAX_WINDOW)
> -#define SEQ_LT(a, b) ((b) - (a) - 1 < MAX_WINDOW)
> -#define SEQ_GE(a, b) ((a) - (b) < MAX_WINDOW)
> -#define SEQ_GT(a, b) ((a) - (b) - 1 < MAX_WINDOW)
> +#define SEQ_LE(a, b) \
> + ((uint32_t)(b) - (uint32_t)(a) < MAX_WINDOW)
> +#define SEQ_LT(a, b) \
> + ((uint32_t)(b) - (uint32_t)(a) - 1 < MAX_WINDOW)
> +#define SEQ_GE(a, b) \
> + ((uint32_t)(a) - (uint32_t)(b) < MAX_WINDOW)
> +#define SEQ_GT(a, b) \
> + ((uint32_t)(a) - (uint32_t)(b) - 1 < MAX_WINDOW)
>
> #define FIN (1 << 0)
> #define SYN (1 << 1)
> --
> 2.43.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent
2025-09-09 18:16 ` [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
@ 2025-09-10 2:27 ` David Gibson
2025-09-10 9:57 ` Stefano Brivio
0 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2025-09-10 2:27 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 3617 bytes --]
On Tue, Sep 09, 2025 at 08:16:54PM +0200, Stefano Brivio wrote:
> We currently have a number of discrepancies in the tcp_tap_handler()
> path between the half-closed connection path and the regular one, and
> they are mostly a result of code duplication, which comes in turn from
> the fact that tcp_data_from_tap() deals with data transfers as well as
> general connection bookkeeping, so we can't use it for half-closed
> connections.
>
> This suggests that we should probably rework it into two or more
> functions, in the long term, but for the moment being I'm just fixing
> one obvious issue, which is the lack of fast retransmissions in the
> TAP_FIN_RCVD path, and a potential one, which is the fact we don't
> handle socket flush failures.
>
> Add fast re-transmit for half-closed connections, and handle the case
> of socket flush (tcp_sock_consume()) flush failure in the same way as
> tcp_data_from_tap() handles it.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 42 +++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 39 insertions(+), 3 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 9c70a25..5163dbf 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1652,6 +1652,23 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> return tcp_buf_data_from_sock(c, conn);
> }
>
> +/**
> + * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet
> + * @th: Pointer to TCP header
> + * @l4len: TCP packet length, including TCP header
> + *
> + * Return: data length of TCP packet, -1 on invalid value of Data Offset field
> + */
> +static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len)
> +{
> + size_t off = th->doff * 4UL;
> +
> + if (off < sizeof(*th) || off > l4len)
> + return -1;
> +
> + return l4len - off;
> +}
> +
> /**
> * tcp_data_from_tap() - tap/guest data for established connection
> * @c: Execution context
> @@ -2113,9 +2130,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>
> /* Established connections not accepting data from tap */
> if (conn->events & TAP_FIN_RCVD) {
> - tcp_sock_consume(conn, ntohl(th->ack_seq));
> - tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> - if (tcp_tap_window_update(c, conn, ntohs(th->window)))
> + bool retr;
> +
> + retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin &&
Not really in scope here, but I wonder if we should log an error
and/or RST if we get a non-zero data length in this situation.
> + ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
> + ntohs(th->window) == conn->wnd_from_tap;
> +
> + /* On socket flush failure, pretend there was no ACK, try again
> + * later
> + */
> + if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq)))
> + tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> +
> + if (retr) {
> + flow_trace(conn,
> + "fast re-transmit, ACK: %u, previous sequence: %u",
> + ntohl(th->ack_seq), conn->seq_to_tap);
> +
> + if (tcp_rewind_seq(c, conn))
> + return -1;
> + }
> +
> + if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr)
> tcp_data_from_sock(c, conn);
>
> if (conn->seq_ack_from_tap == conn->seq_to_tap) {
> --
> 2.43.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data
2025-09-09 18:16 ` [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data Stefano Brivio
@ 2025-09-10 2:29 ` David Gibson
2025-09-10 6:37 ` Stefano Brivio
0 siblings, 1 reply; 18+ messages in thread
From: David Gibson @ 2025-09-10 2:29 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 4209 bytes --]
On Tue, Sep 09, 2025 at 08:16:55PM +0200, Stefano Brivio wrote:
> For some reason, tcp_vu_data_from_sock() already takes care of this,
> but the non-vhost-user version ignores this possibility and just sends
> out a FIN segment whenever we infer we received one host-side,
> regardless of the fact that we might have unacknowledged data still to
> send.
>
> Somewhat surprisingly, this didn't cause any issue to be reported yet,
> until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> came around, leading to the following report from Paul, who hit this
> running Podman tests:
>
> 439 0.033032 169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480
> 440 0.033055 169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264
>
> we're sending data to the container, up to the edge of the window
>
> 441 0.033059 192.168.122.100 → 169.254.1.2 60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0
>
> and the container acknowledges it
>
> 442 0.033091 169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656
>
> we send more data, all we possibly can, in window
>
> 443 0.033126 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
>
> and the container shrinks the window due to the issue introduced
> by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no
> memory"). With a previous patch from this series, we rewind the
> sequence, meaning that we assign conn->seq_to_tap from
> conn->seq_ack_from_tap, so that we'll retransmit this segment, by
> reading again from the socket, and increasing conn->seq_to_tap
> once more.
>
> However:
>
> 444 0.033144 169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0
>
> we eventually get a zero-length read from the socket and we miss the
> fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a
> FIN flag with the most recent sequence. The kernel insists:
>
> 445 0.033147 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
>
> with its buggy zero-window update, but:
>
> 446 0.033152 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0
> 447 0.033202 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0
>
> we don't reset the TAP_FIN_SENT flag anymore, and don't resend
> the FIN segment (nor data), as we already rewound the sequence
> earlier.
>
> To solve this, hold off the FIN segment until we get a zero-length
> read from the socket *and* we know that there's no unacknowledged
> pending data, also without vhost-user, in tcp_buf_data_from_sock().
>
> Reported-by: Paul Holzinger <pholzing@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
At least, I think this makes sense. As always, I find the semantics
of the STALLED flag difficult to wrap my head around.
> ---
> tcp_buf.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/tcp_buf.c b/tcp_buf.c
> index 4ebb013..49bddbe 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -363,7 +363,10 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> }
>
> if (!len) {
> - if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> + if (already_sent) {
> + conn_flag(c, conn, STALLED);
> + } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> + SOCK_FIN_RCVD) {
> int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> if (ret) {
> tcp_rst(c, conn);
> --
> 2.43.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero
2025-09-10 2:20 ` David Gibson
@ 2025-09-10 6:37 ` Stefano Brivio
2025-09-10 7:18 ` David Gibson
0 siblings, 1 reply; 18+ messages in thread
From: Stefano Brivio @ 2025-09-10 6:37 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Jon Maloy, Paul Holzinger
On Wed, 10 Sep 2025 12:20:19 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Sep 09, 2025 at 08:16:50PM +0200, Stefano Brivio wrote:
> > A window shrunk to zero means by definition that anything else that
> > might be in flight is now out of window. Restart from the currently
> > acknowledged sequence.
> >
> > We need to do that both in tcp_tap_window_update(), where we already
> > check for zero-window updates, as well as in tcp_data_from_tap(),
> > because we might get one of those updates in a batch of packets that
> > also contains a non-zero window update.
> >
> > Suggested-by: Jon Maloy <jmaloy@redhat.com>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> Though a couple of documentation nits below.
>
> > ---
> > tcp.c | 34 +++++++++++++++++++++++++---------
> > 1 file changed, 25 insertions(+), 9 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 86e08f1..12d42e0 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1268,19 +1268,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
> >
> > /**
> > * tcp_tap_window_update() - Process an updated window from tap side
> > + * @c: Execution context
> > * @conn: Connection pointer
> > * @wnd: Window value, host order, unscaled
> > */
> > -static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> > +static void tcp_tap_window_update(const struct ctx *c,
> > + struct tcp_tap_conn *conn, unsigned wnd)
> > {
> > wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
> >
> > /* Work-around for bug introduced in peer kernel code, commit
> > - * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
> > - * We don't update if window shrank to zero.
> > + * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
> > + * update the window if it shrank to zero, so that we'll eventually
> > + * retry to send data, but rewind the sequence as that obviously implies
> > + * that no data beyond the updated window will ever be acknowledged.
>
> As noted earlier "will ever be acknowledged" might be a bit
> misleading. Maybe "no data beyond the window will be acknowledged
> until it is retransmitted".
Sorry, I actually fixed the patch after your same comment on v3 and
then for some reason I threw away the changes together with some of
your Reviewed-by: tags. Fixed as we discussed on v3 now.
> > */
> > - if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
> > + if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
> > + tcp_rewind_seq(c, conn);
> > return;
> > + }
> >
> > conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
> >
> > @@ -1709,7 +1715,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > tcp_timer_ctl(c, conn);
> >
> > if (p->count == 1) {
> > - tcp_tap_window_update(conn, ntohs(th->window));
> > + tcp_tap_window_update(c, conn,
> > + ntohs(th->window));
> > return 1;
> > }
> >
> > @@ -1728,6 +1735,15 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > ack_seq == max_ack_seq &&
> > ntohs(th->window) == max_ack_seq_wnd;
> >
> > + /* See tcp_tap_window_update() for details. On
> > + * top of that, we also need to check here if a
> > + * zero-window update is contained in a batch of
> > + * packets that includes a non-zero window as
> > + * well.
>
> I'm not 100% convinced of this reasoning. But at worst this should
> result in some unnecessary but mostly harmless retransmits, and it
> seems to fix the problem empirically, so I'm not suggesting changing
> it at this time.
Right, strictly speaking, it *might* be that the peer didn't actually
throw data away, which should be indicated by the fact that the same
data is acknowledged in another segment of this same batch.
But handling that without making this part unreasonably complicated
implies a refactor of this whole thing: we should extract a function
dealing with sequences and perhaps another one with window updates from
here.
I plan to file a ticket or two with stuff that emerged from this series.
--
Stefano
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data
2025-09-10 2:29 ` David Gibson
@ 2025-09-10 6:37 ` Stefano Brivio
0 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2025-09-10 6:37 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Jon Maloy, Paul Holzinger
On Wed, 10 Sep 2025 12:29:48 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Sep 09, 2025 at 08:16:55PM +0200, Stefano Brivio wrote:
> > For some reason, tcp_vu_data_from_sock() already takes care of this,
> > but the non-vhost-user version ignores this possibility and just sends
> > out a FIN segment whenever we infer we received one host-side,
> > regardless of the fact that we might have unacknowledged data still to
> > send.
> >
> > Somewhat surprisingly, this didn't cause any issue to be reported yet,
> > until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> > came around, leading to the following report from Paul, who hit this
> > running Podman tests:
> >
> > 439 0.033032 169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480
> > 440 0.033055 169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264
> >
> > we're sending data to the container, up to the edge of the window
> >
> > 441 0.033059 192.168.122.100 → 169.254.1.2 60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0
> >
> > and the container acknowledges it
> >
> > 442 0.033091 169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656
> >
> > we send more data, all we possibly can, in window
> >
> > 443 0.033126 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
> >
> > and the container shrinks the window due to the issue introduced
> > by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no
> > memory"). With a previous patch from this series, we rewind the
> > sequence, meaning that we assign conn->seq_to_tap from
> > conn->seq_ack_from_tap, so that we'll retransmit this segment, by
> > reading again from the socket, and increasing conn->seq_to_tap
> > once more.
> >
> > However:
> >
> > 444 0.033144 169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0
> >
> > we eventually get a zero-length read from the socket and we miss the
> > fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a
> > FIN flag with the most recent sequence. The kernel insists:
> >
> > 445 0.033147 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
> >
> > with its buggy zero-window update, but:
> >
> > 446 0.033152 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0
> > 447 0.033202 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0
> >
> > we don't reset the TAP_FIN_SENT flag anymore, and don't resend
> > the FIN segment (nor data), as we already rewound the sequence
> > earlier.
> >
> > To solve this, hold off the FIN segment until we get a zero-length
> > read from the socket *and* we know that there's no unacknowledged
> > pending data, also without vhost-user, in tcp_buf_data_from_sock().
> >
> > Reported-by: Paul Holzinger <pholzing@redhat.com>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> At least, I think this makes sense. As always, I find the semantics
> of the STALLED flag difficult to wrap my head around.
After this fix I start thinking that we should probably split STALLED
into several flags explicitly indicating what we are waiting for,
because it's rather complicated to make sure that some code path will
eventually take care of doing something on a later trigger/condition
(in this case, sending the FIN flag when we can).
I don't have a concrete list of possible flags in mind but something
like WAITING_ON_(x), or consistent sets of BLOCKED_BY_(x) (we can't
proceed with anything otherwise) and EXPECTING_(x) (x is expected but
we can proceed with something else).
Maybe we'll need to go the full (y)_BLOCKED_BY_(x) route, though
(FIN_BLOCKED_BY_TAP_ACK?). I hope not.
--
Stefano
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero
2025-09-10 6:37 ` Stefano Brivio
@ 2025-09-10 7:18 ` David Gibson
0 siblings, 0 replies; 18+ messages in thread
From: David Gibson @ 2025-09-10 7:18 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 4872 bytes --]
On Wed, Sep 10, 2025 at 08:37:48AM +0200, Stefano Brivio wrote:
> On Wed, 10 Sep 2025 12:20:19 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Sep 09, 2025 at 08:16:50PM +0200, Stefano Brivio wrote:
> > > A window shrunk to zero means by definition that anything else that
> > > might be in flight is now out of window. Restart from the currently
> > > acknowledged sequence.
> > >
> > > We need to do that both in tcp_tap_window_update(), where we already
> > > check for zero-window updates, as well as in tcp_data_from_tap(),
> > > because we might get one of those updates in a batch of packets that
> > > also contains a non-zero window update.
> > >
> > > Suggested-by: Jon Maloy <jmaloy@redhat.com>
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> >
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> > Though a couple of documentation nits below.
> >
> > > ---
> > > tcp.c | 34 +++++++++++++++++++++++++---------
> > > 1 file changed, 25 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 86e08f1..12d42e0 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -1268,19 +1268,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
> > >
> > > /**
> > > * tcp_tap_window_update() - Process an updated window from tap side
> > > + * @c: Execution context
> > > * @conn: Connection pointer
> > > * @wnd: Window value, host order, unscaled
> > > */
> > > -static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> > > +static void tcp_tap_window_update(const struct ctx *c,
> > > + struct tcp_tap_conn *conn, unsigned wnd)
> > > {
> > > wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
> > >
> > > /* Work-around for bug introduced in peer kernel code, commit
> > > - * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
> > > - * We don't update if window shrank to zero.
> > > + * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
> > > + * update the window if it shrank to zero, so that we'll eventually
> > > + * retry to send data, but rewind the sequence as that obviously implies
> > > + * that no data beyond the updated window will ever be acknowledged.
> >
> > As noted earlier "will ever be acknowledged" might be a bit
> > misleading. Maybe "no data beyond the window will be acknowledged
> > until it is retransmitted".
>
> Sorry, I actually fixed the patch after your same comment on v3 and
> then for some reason I threw away the changes together with some of
> your Reviewed-by: tags. Fixed as we discussed on v3 now.
Ah, right. It's happened to all of us.
>
> > > */
> > > - if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
> > > + if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
> > > + tcp_rewind_seq(c, conn);
> > > return;
> > > + }
> > >
> > > conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
> > >
> > > @@ -1709,7 +1715,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > tcp_timer_ctl(c, conn);
> > >
> > > if (p->count == 1) {
> > > - tcp_tap_window_update(conn, ntohs(th->window));
> > > + tcp_tap_window_update(c, conn,
> > > + ntohs(th->window));
> > > return 1;
> > > }
> > >
> > > @@ -1728,6 +1735,15 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > ack_seq == max_ack_seq &&
> > > ntohs(th->window) == max_ack_seq_wnd;
> > >
> > > + /* See tcp_tap_window_update() for details. On
> > > + * top of that, we also need to check here if a
> > > + * zero-window update is contained in a batch of
> > > + * packets that includes a non-zero window as
> > > + * well.
> >
> > I'm not 100% convinced of this reasoning. But at worst this should
> > result in some unnecessary but mostly harmless retransmits, and it
> > seems to fix the problem empirically, so I'm not suggesting changing
> > it at this time.
>
> Right, strictly speaking, it *might* be that the peer didn't actually
> throw data away, which should be indicated by the fact that the same
> data is acknowledged in another segment of this same batch.
>
> But handling that without making this part unreasonably complicated
> implies a refactor of this whole thing: we should extract a function
> dealing with sequences and perhaps another one with window updates from
> here.
>
> I plan to file a ticket or two with stuff that emerged from this series.
Ok, sounds good.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (7 preceding siblings ...)
2025-09-09 18:16 ` [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data Stefano Brivio
@ 2025-09-10 9:10 ` Paul Holzinger
8 siblings, 0 replies; 18+ messages in thread
From: Paul Holzinger @ 2025-09-10 9:10 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: Jon Maloy, David Gibson
On 09/09/2025 20:16, Stefano Brivio wrote:
> Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger
> sk_rcvbuf checks"), window limits are enforced more aggressively with
> a bigger amount of zero-window updates compared to what happened with
> e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone,
> and occasional duplicate ACKs can now be seen also for local transfers
> with default (208 KiB) socket buffer sizes.
>
> Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the
> pasta integration occasionally fail on the "TCP/IPv4 large transfer,
> tap" case.
>
> While playing with a reproducer that seems to be matching those
> failures:
>
> while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
>
> and a kernel including that commit, I hit a few different failures,
> that should be fixed by this series.
>
> Paul tested v1 of this series and found an additional failure
> (transfer timeout), which I could reproduce with a slightly different
> command:
>
> while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
>
> where write.sh is simply:
>
> #!/bin/sh
>
> cat > /tmp/large.rcv
>
> so that the connection is not half-closed starting from the beginning,
> because socat can't make assumptions about the unidirectional nature
> of the traffic. This should now be fixed as well by the new version of
> patch 3/8.
>
> v4:
> - add patch 8/8
>
> v3:
> - add patch 6/7
> - in 7/7, check dlen <= 1 for keep-alive segments, instead of len <= 1
>
> v2: in 3/6, rewind sequence also if the zero-window update comes in
> the middle of a batch with non-zero window updates
>
> Stefano Brivio (8):
> tcp: FIN flags have to be retransmitted as well
> tcp: Factor sequence rewind for retransmissions into a new function
> tcp: Rewind sequence when guest shrinks window to zero
> tcp: Fix closing logic for half-closed connections
> tcp: Don't try to transmit right after the peer shrank the window to
> zero
> tcp: Cast operands of sequence comparison macros to uint32_t before
> using them
> tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path
> consistent
> tcp: Don't send FIN segment to guest yet if we have pending
> unacknowledged data
>
> tcp.c | 142 ++++++++++++++++++++++++++++++++++++++-----------
> tcp_buf.c | 5 +-
> tcp_internal.h | 12 +++--
> 3 files changed, 122 insertions(+), 37 deletions(-)
>
Reproducer runs for over 14 hours now without failure so looks like we
found all the problems now.
Tested-by: Paul Holzinger <pholzing@redhat.com>
--
Paul Holzinger
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent
2025-09-10 2:27 ` David Gibson
@ 2025-09-10 9:57 ` Stefano Brivio
0 siblings, 0 replies; 18+ messages in thread
From: Stefano Brivio @ 2025-09-10 9:57 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Jon Maloy, Paul Holzinger
On Wed, 10 Sep 2025 12:27:12 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Sep 09, 2025 at 08:16:54PM +0200, Stefano Brivio wrote:
> > We currently have a number of discrepancies in the tcp_tap_handler()
> > path between the half-closed connection path and the regular one, and
> > they are mostly a result of code duplication, which comes in turn from
> > the fact that tcp_data_from_tap() deals with data transfers as well as
> > general connection bookkeeping, so we can't use it for half-closed
> > connections.
> >
> > This suggests that we should probably rework it into two or more
> > functions, in the long term, but for the moment being I'm just fixing
> > one obvious issue, which is the lack of fast retransmissions in the
> > TAP_FIN_RCVD path, and a potential one, which is the fact we don't
> > handle socket flush failures.
> >
> > Add fast re-transmit for half-closed connections, and handle the case
> > of socket flush (tcp_sock_consume()) flush failure in the same way as
> > tcp_data_from_tap() handles it.
> >
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> > ---
> > tcp.c | 42 +++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 39 insertions(+), 3 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 9c70a25..5163dbf 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1652,6 +1652,23 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > return tcp_buf_data_from_sock(c, conn);
> > }
> >
> > +/**
> > + * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet
> > + * @th: Pointer to TCP header
> > + * @l4len: TCP packet length, including TCP header
> > + *
> > + * Return: data length of TCP packet, -1 on invalid value of Data Offset field
> > + */
> > +static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len)
> > +{
> > + size_t off = th->doff * 4UL;
> > +
> > + if (off < sizeof(*th) || off > l4len)
> > + return -1;
> > +
> > + return l4len - off;
> > +}
> > +
> > /**
> > * tcp_data_from_tap() - tap/guest data for established connection
> > * @c: Execution context
> > @@ -2113,9 +2130,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> >
> > /* Established connections not accepting data from tap */
> > if (conn->events & TAP_FIN_RCVD) {
> > - tcp_sock_consume(conn, ntohl(th->ack_seq));
> > - tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> > - if (tcp_tap_window_update(c, conn, ntohs(th->window)))
> > + bool retr;
> > +
> > + retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin &&
>
> Not really in scope here, but I wonder if we should log an error
> and/or RST if we get a non-zero data length in this situation.
According to RFC 9293 we should ignore data (note: not data segments)
in this case, see 3.10.7.4 "Other states":
[...]
Seventh, process the segment text:
[...]
CLOSE-WAIT STATE
This should not occur since a FIN has been received from the remote side. Ignore the segment text.
https://www.rfc-editor.org/rfc/rfc9293.html#section-3.10.7.4-2.7.2.7.1
We could add a debug() message perhaps (in a further patch), but I don't
think we are allowed to reset the connection.
> > + ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
> > + ntohs(th->window) == conn->wnd_from_tap;
> > +
> > + /* On socket flush failure, pretend there was no ACK, try again
> > + * later
> > + */
> > + if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq)))
> > + tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> > +
> > + if (retr) {
> > + flow_trace(conn,
> > + "fast re-transmit, ACK: %u, previous sequence: %u",
> > + ntohl(th->ack_seq), conn->seq_to_tap);
> > +
> > + if (tcp_rewind_seq(c, conn))
> > + return -1;
> > + }
> > +
> > + if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr)
> > tcp_data_from_sock(c, conn);
> >
> > if (conn->seq_ack_from_tap == conn->seq_to_tap) {
> > --
> > 2.43.0
--
Stefano
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-09-10 9:57 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 1/8] tcp: FIN flags have to be retransmitted as well Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 2/8] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
2025-09-10 2:20 ` David Gibson
2025-09-10 6:37 ` Stefano Brivio
2025-09-10 7:18 ` David Gibson
2025-09-09 18:16 ` [PATCH v4 4/8] tcp: Fix closing logic for half-closed connections Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 5/8] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 6/8] tcp: Cast operands of sequence comparison macros to uint32_t before using them Stefano Brivio
2025-09-10 2:21 ` David Gibson
2025-09-09 18:16 ` [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
2025-09-10 2:27 ` David Gibson
2025-09-10 9:57 ` Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data Stefano Brivio
2025-09-10 2:29 ` David Gibson
2025-09-10 6:37 ` Stefano Brivio
2025-09-10 9:10 ` [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).