* [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels
@ 2025-08-15 16:10 Stefano Brivio
2025-08-15 16:10 ` [PATCH 1/6] tcp: FIN flags have to be retransmitted as well Stefano Brivio
` (6 more replies)
0 siblings, 7 replies; 17+ messages in thread
From: Stefano Brivio @ 2025-08-15 16:10 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger
sk_rcvbuf checks"), window limits are enforced more aggressively with
a bigger amount of zero-window updates compared to what happened with
e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone,
and occasional duplicate ACKs can now be seen also for local transfers
with default (208 KiB) socket buffer sizes.
Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the
pasta integration occasionally fail on the "TCP/IPv4 large transfer,
tap" case.
While playing with a reproducer that seems to be matching those
failures:
while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
and a kernel including that commit, I hit a few different failures,
that should be fixed by this series.
Stefano Brivio (6):
tcp: FIN flags have to be retransmitted as well
tcp: Factor sequence rewind for retransmissions into a new function
tcp: Rewind sequence when guest shrinks window to zero
tcp: Fix closing logic for half-closed connections
tcp: Don't try to transmit right after the peer shrank the window to
zero
tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path
consistent
tcp.c | 170 ++++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 118 insertions(+), 52 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/6] tcp: FIN flags have to be retransmitted as well
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
@ 2025-08-15 16:10 ` Stefano Brivio
2025-08-18 2:43 ` David Gibson
2025-08-15 16:10 ` [PATCH 2/6] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
` (5 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: Stefano Brivio @ 2025-08-15 16:10 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger
If we're retransmitting any data, and we sent a FIN segment to our
peer, regardless of whether it was received, we obviously have to
retransmit it as well, given that it can only come with the last data
segment, or after it.
Unconditionally clear the internal TAP_FIN_SENT flag whenever we
re-transmit, so that we know we have to send it again, in case.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tcp.c b/tcp.c
index 957b498..7c1f237 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1759,6 +1759,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
max_ack_seq, conn->seq_to_tap);
conn->seq_to_tap = max_ack_seq;
+ conn->events &= ~TAP_FIN_SENT;
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
@@ -2286,6 +2287,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
flow_dbg(conn, "ACK timeout, retry");
conn->retrans++;
conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
if (tcp_set_peek_offset(conn, 0)) {
--
@@ -1759,6 +1759,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
max_ack_seq, conn->seq_to_tap);
conn->seq_to_tap = max_ack_seq;
+ conn->events &= ~TAP_FIN_SENT;
if (tcp_set_peek_offset(conn, 0)) {
tcp_rst(c, conn);
return -1;
@@ -2286,6 +2287,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
flow_dbg(conn, "ACK timeout, retry");
conn->retrans++;
conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
if (tcp_set_peek_offset(conn, 0)) {
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/6] tcp: Factor sequence rewind for retransmissions into a new function
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
2025-08-15 16:10 ` [PATCH 1/6] tcp: FIN flags have to be retransmitted as well Stefano Brivio
@ 2025-08-15 16:10 ` Stefano Brivio
2025-08-18 2:46 ` David Gibson
2025-08-15 16:10 ` [PATCH 3/6] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
` (4 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: Stefano Brivio @ 2025-08-15 16:10 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger
...as I'm going to need a third occurrence of this in the next change.
This introduces a small functional change in tcp_data_from_tap(): the
sequence was previously rewound to the highest ACK number we found in
the current packet batch, and not to the current value of
seq_ack_from_tap.
The two might differ in case tcp_sock_consume() failed, because in
that case we're ignoring that ACK altogether. But if we're ignoring
it, it looks more correct to me to start retransmitting from an
earlier sequence anyway.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 47 ++++++++++++++++++++++++++++++++---------------
1 file changed, 32 insertions(+), 15 deletions(-)
diff --git a/tcp.c b/tcp.c
index 7c1f237..1402ca2 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1097,6 +1097,26 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
}
}
+/**
+ * tcp_rewind_seq() - Rewind sequence to tap and socket offset to current ACK
+ * @c: Execution context
+ * @conn: Connection pointer
+ *
+ * Return: 0 on success, -1 on failure, with connection reset
+ */
+static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn)
+{
+ conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
+
+ if (tcp_set_peek_offset(conn, 0)) {
+ tcp_rst(c, conn);
+ return -1;
+ }
+
+ return 0;
+}
+
/**
* tcp_prepare_flags() - Prepare header for flags-only segment (no payload)
* @c: Execution context
@@ -1757,13 +1777,11 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (retr) {
flow_trace(conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
- max_ack_seq, conn->seq_to_tap);
- conn->seq_to_tap = max_ack_seq;
- conn->events &= ~TAP_FIN_SENT;
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
+ conn->seq_ack_from_tap, conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
return -1;
- }
+
tcp_data_from_sock(c, conn);
}
@@ -2285,17 +2303,16 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
tcp_rst(c, conn);
} else {
flow_dbg(conn, "ACK timeout, retry");
- conn->retrans++;
- conn->seq_to_tap = conn->seq_ack_from_tap;
- conn->events &= ~TAP_FIN_SENT;
+
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
- } else {
- tcp_data_from_sock(c, conn);
- tcp_timer_ctl(c, conn);
- }
+
+ conn->retrans++;
+ if (tcp_rewind_seq(c, conn))
+ return;
+
+ tcp_data_from_sock(c, conn);
+ tcp_timer_ctl(c, conn);
}
} else {
struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
--
@@ -1097,6 +1097,26 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
}
}
+/**
+ * tcp_rewind_seq() - Rewind sequence to tap and socket offset to current ACK
+ * @c: Execution context
+ * @conn: Connection pointer
+ *
+ * Return: 0 on success, -1 on failure, with connection reset
+ */
+static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn)
+{
+ conn->seq_to_tap = conn->seq_ack_from_tap;
+ conn->events &= ~TAP_FIN_SENT;
+
+ if (tcp_set_peek_offset(conn, 0)) {
+ tcp_rst(c, conn);
+ return -1;
+ }
+
+ return 0;
+}
+
/**
* tcp_prepare_flags() - Prepare header for flags-only segment (no payload)
* @c: Execution context
@@ -1757,13 +1777,11 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (retr) {
flow_trace(conn,
"fast re-transmit, ACK: %u, previous sequence: %u",
- max_ack_seq, conn->seq_to_tap);
- conn->seq_to_tap = max_ack_seq;
- conn->events &= ~TAP_FIN_SENT;
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
+ conn->seq_ack_from_tap, conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
return -1;
- }
+
tcp_data_from_sock(c, conn);
}
@@ -2285,17 +2303,16 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
tcp_rst(c, conn);
} else {
flow_dbg(conn, "ACK timeout, retry");
- conn->retrans++;
- conn->seq_to_tap = conn->seq_ack_from_tap;
- conn->events &= ~TAP_FIN_SENT;
+
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
- if (tcp_set_peek_offset(conn, 0)) {
- tcp_rst(c, conn);
- } else {
- tcp_data_from_sock(c, conn);
- tcp_timer_ctl(c, conn);
- }
+
+ conn->retrans++;
+ if (tcp_rewind_seq(c, conn))
+ return;
+
+ tcp_data_from_sock(c, conn);
+ tcp_timer_ctl(c, conn);
}
} else {
struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/6] tcp: Rewind sequence when guest shrinks window to zero
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
2025-08-15 16:10 ` [PATCH 1/6] tcp: FIN flags have to be retransmitted as well Stefano Brivio
2025-08-15 16:10 ` [PATCH 2/6] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
@ 2025-08-15 16:10 ` Stefano Brivio
2025-08-18 2:49 ` David Gibson
2025-08-15 16:10 ` [PATCH 4/6] tcp: Fix closing logic for half-closed connections Stefano Brivio
` (3 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: Stefano Brivio @ 2025-08-15 16:10 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger
A window shrunk to zero means by definition that anything else that
might be in flight is now out of window. Restart from the currently
acknowledged sequence.
Suggested-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/tcp.c b/tcp.c
index 1402ca2..dda0a2e 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1257,19 +1257,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
/**
* tcp_tap_window_update() - Process an updated window from tap side
+ * @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
*/
-static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
+static void tcp_tap_window_update(const struct ctx *c,
+ struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
/* Work-around for bug introduced in peer kernel code, commit
- * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
- * We don't update if window shrank to zero.
+ * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
+ * update the window if it shrank to zero, so that we'll eventually
+ * retry to send data, but rewind the sequence as that obviously implies
+ * that no data beyond the updated window will ever be acknowledged.
*/
- if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
+ if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
+ tcp_rewind_seq(c, conn);
return;
+ }
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
@@ -1694,7 +1700,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
tcp_timer_ctl(c, conn);
if (p->count == 1) {
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn,
+ ntohs(th->window));
return 1;
}
@@ -1772,7 +1779,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (ack && !tcp_sock_consume(conn, max_ack_seq))
tcp_update_seqack_from_tap(c, conn, max_ack_seq);
- tcp_tap_window_update(conn, max_ack_seq_wnd);
+ tcp_tap_window_update(c, conn, max_ack_seq_wnd);
if (retr) {
flow_trace(conn,
@@ -1861,7 +1868,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
const struct tcphdr *th,
const char *opts, size_t optlen)
{
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_get_tap_ws(conn, opts, optlen);
/* First value is not scaled */
@@ -2059,7 +2066,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
@@ -2071,7 +2078,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
if (conn->events & SOCK_FIN_RCVD &&
--
@@ -1257,19 +1257,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
/**
* tcp_tap_window_update() - Process an updated window from tap side
+ * @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
*/
-static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
+static void tcp_tap_window_update(const struct ctx *c,
+ struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
/* Work-around for bug introduced in peer kernel code, commit
- * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
- * We don't update if window shrank to zero.
+ * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
+ * update the window if it shrank to zero, so that we'll eventually
+ * retry to send data, but rewind the sequence as that obviously implies
+ * that no data beyond the updated window will ever be acknowledged.
*/
- if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
+ if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
+ tcp_rewind_seq(c, conn);
return;
+ }
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
@@ -1694,7 +1700,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
tcp_timer_ctl(c, conn);
if (p->count == 1) {
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn,
+ ntohs(th->window));
return 1;
}
@@ -1772,7 +1779,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (ack && !tcp_sock_consume(conn, max_ack_seq))
tcp_update_seqack_from_tap(c, conn, max_ack_seq);
- tcp_tap_window_update(conn, max_ack_seq_wnd);
+ tcp_tap_window_update(c, conn, max_ack_seq_wnd);
if (retr) {
flow_trace(conn,
@@ -1861,7 +1868,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
const struct tcphdr *th,
const char *opts, size_t optlen)
{
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_get_tap_ws(conn, opts, optlen);
/* First value is not scaled */
@@ -2059,7 +2066,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
@@ -2071,7 +2078,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(conn, ntohs(th->window));
+ tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
if (conn->events & SOCK_FIN_RCVD &&
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 4/6] tcp: Fix closing logic for half-closed connections
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (2 preceding siblings ...)
2025-08-15 16:10 ` [PATCH 3/6] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
@ 2025-08-15 16:10 ` Stefano Brivio
2025-08-18 2:52 ` David Gibson
2025-08-15 16:10 ` [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
` (2 subsequent siblings)
6 siblings, 1 reply; 17+ messages in thread
From: Stefano Brivio @ 2025-08-15 16:10 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger
First off, don't close connections half-closed by the guest before
our own FIN is acknowledged by the guest itself.
That is, after we receive a FIN from the guest (TAP_FIN_RCVD), if we
don't have any data left to send from the socket (SOCK_FIN_RCVD, or
EPOLLHUP), we send a FIN segment to the guest (TAP_FIN_SENT), but we
need to actually have it acknowledged (and have no pending
retransmissions) before we can close the connection: check for
TAP_FIN_ACKED, first.
Then, if we set TAP_FIN_SENT, and we receive an ACK segment from the
guest, set TAP_FIN_ACKED. This was entirely missing for the
TAP_FIN_RCVD case, and as we fix the problem described above, this
becomes relevant as well.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/tcp.c b/tcp.c
index dda0a2e..aed25a9 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2081,9 +2081,14 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
- if (conn->events & SOCK_FIN_RCVD &&
- conn->seq_ack_from_tap == conn->seq_to_tap)
- conn_event(c, conn, CLOSED);
+ if (conn->seq_ack_from_tap == conn->seq_to_tap) {
+ if (th->ack && conn->events & TAP_FIN_SENT)
+ conn_event(c, conn, TAP_FIN_ACKED);
+
+ if (conn->events & SOCK_FIN_RCVD &&
+ conn->events & TAP_FIN_ACKED)
+ conn_event(c, conn, CLOSED);
+ }
return 1;
}
@@ -2363,7 +2368,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
return;
}
- if ((conn->events & TAP_FIN_SENT) && (events & EPOLLHUP)) {
+ if ((conn->events & TAP_FIN_ACKED) && (events & EPOLLHUP)) {
conn_event(c, conn, CLOSED);
return;
}
--
@@ -2081,9 +2081,14 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
tcp_tap_window_update(c, conn, ntohs(th->window));
tcp_data_from_sock(c, conn);
- if (conn->events & SOCK_FIN_RCVD &&
- conn->seq_ack_from_tap == conn->seq_to_tap)
- conn_event(c, conn, CLOSED);
+ if (conn->seq_ack_from_tap == conn->seq_to_tap) {
+ if (th->ack && conn->events & TAP_FIN_SENT)
+ conn_event(c, conn, TAP_FIN_ACKED);
+
+ if (conn->events & SOCK_FIN_RCVD &&
+ conn->events & TAP_FIN_ACKED)
+ conn_event(c, conn, CLOSED);
+ }
return 1;
}
@@ -2363,7 +2368,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
return;
}
- if ((conn->events & TAP_FIN_SENT) && (events & EPOLLHUP)) {
+ if ((conn->events & TAP_FIN_ACKED) && (events & EPOLLHUP)) {
conn_event(c, conn, CLOSED);
return;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (3 preceding siblings ...)
2025-08-15 16:10 ` [PATCH 4/6] tcp: Fix closing logic for half-closed connections Stefano Brivio
@ 2025-08-15 16:10 ` Stefano Brivio
2025-08-18 2:56 ` David Gibson
2025-08-15 16:10 ` [PATCH 6/6] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
2025-08-18 17:40 ` [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger
6 siblings, 1 reply; 17+ messages in thread
From: Stefano Brivio @ 2025-08-15 16:10 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger
If the peer shrinks the window to zero, we'll skip storing the new
window, as a convenient way to cause window probes (which exceed any
zero-sized window, strictly speaking) if we don't get window updates
in a while.
As we do so, though, we need to ensure we don't try to queue more data
from the socket right after we process this window update, as the
entire point of a zero-window advertisement is to keep us from sending
more data.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/tcp.c b/tcp.c
index aed25a9..624e7f4 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1260,8 +1260,10 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
* @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
+ *
+ * Return: false on zero window (not stored to wnd_from_tap), true otherwise
*/
-static void tcp_tap_window_update(const struct ctx *c,
+static bool tcp_tap_window_update(const struct ctx *c,
struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
@@ -1274,13 +1276,14 @@ static void tcp_tap_window_update(const struct ctx *c,
*/
if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
tcp_rewind_seq(c, conn);
- return;
+ return false;
}
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
/* FIXME: reflect the tap-side receiver's window back to the sock-side
* sender by adjusting SO_RCVBUF? */
+ return true;
}
/**
@@ -2066,9 +2069,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(c, conn, ntohs(th->window));
-
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (p->count - idx == 1)
return 1;
@@ -2078,8 +2080,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(c, conn, ntohs(th->window));
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
if (th->ack && conn->events & TAP_FIN_SENT)
--
@@ -1260,8 +1260,10 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
* @c: Execution context
* @conn: Connection pointer
* @wnd: Window value, host order, unscaled
+ *
+ * Return: false on zero window (not stored to wnd_from_tap), true otherwise
*/
-static void tcp_tap_window_update(const struct ctx *c,
+static bool tcp_tap_window_update(const struct ctx *c,
struct tcp_tap_conn *conn, unsigned wnd)
{
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
@@ -1274,13 +1276,14 @@ static void tcp_tap_window_update(const struct ctx *c,
*/
if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
tcp_rewind_seq(c, conn);
- return;
+ return false;
}
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
/* FIXME: reflect the tap-side receiver's window back to the sock-side
* sender by adjusting SO_RCVBUF? */
+ return true;
}
/**
@@ -2066,9 +2069,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (!th->ack)
goto reset;
- tcp_tap_window_update(c, conn, ntohs(th->window));
-
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (p->count - idx == 1)
return 1;
@@ -2078,8 +2080,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
if (conn->events & TAP_FIN_RCVD) {
tcp_sock_consume(conn, ntohl(th->ack_seq));
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- tcp_tap_window_update(c, conn, ntohs(th->window));
- tcp_data_from_sock(c, conn);
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
if (th->ack && conn->events & TAP_FIN_SENT)
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 6/6] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (4 preceding siblings ...)
2025-08-15 16:10 ` [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
@ 2025-08-15 16:10 ` Stefano Brivio
2025-08-18 3:04 ` David Gibson
2025-08-18 17:40 ` [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger
6 siblings, 1 reply; 17+ messages in thread
From: Stefano Brivio @ 2025-08-15 16:10 UTC (permalink / raw)
To: passt-dev; +Cc: Jon Maloy, Paul Holzinger
We currently have a number of discrepancies in the tcp_tap_handler()
path between the half-closed connection path and the regular one, and
they are mostly a result of code duplication, which comes in turn from
the fact that tcp_data_from_tap() deals with data transfers as well as
general connection bookkeeping, so we can't use it for half-closed
connections.
This suggests that we should probably rework it into two or more
functions, in the long term, but for the moment being I'm just fixing
one obvious issue, which is the lack of fast retransmissions in the
TAP_FIN_RCVD path, and a potential one, which is the fact we don't
handle socket flush failures.
Add fast re-transmit for half-closed connections, and extract the
logic to determine the TCP payload length from tcp_data_from_tap()
into the new tcp_packet_data_len() helper to decrease a bit the amount
of resulting code duplication.
Handle the case of socket flush (tcp_sock_consume()) flush failure in
the same way as tcp_data_from_tap() handles it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp.c | 79 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 56 insertions(+), 23 deletions(-)
diff --git a/tcp.c b/tcp.c
index 624e7f4..163f820 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1640,6 +1640,22 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
return tcp_buf_data_from_sock(c, conn);
}
+/**
+ * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet
+ * @th: Pointer to TCP header
+ * @l4len: TCP packet length, including TCP header
+ *
+ * Return: data length of TCP packet, -1 on invalid value of Data Offset field
+ */
+static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len)
+{
+ size_t off = th->doff * 4UL;
+
+ if (off < sizeof(*th) || off > l4len)
+ return -1;
+
+ return l4len - off;
+}
/**
* tcp_data_from_tap() - tap/guest data for established connection
@@ -1671,27 +1687,21 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
for (i = idx, iov_i = 0; i < (int)p->count; i++) {
uint32_t seq, seq_offset, ack_seq;
const struct tcphdr *th;
+ ssize_t dlen;
char *data;
- size_t off;
th = packet_get(p, i, 0, sizeof(*th), &len);
if (!th)
return -1;
len += sizeof(*th);
- off = th->doff * 4UL;
- if (off < sizeof(*th) || off > len)
- return -1;
-
if (th->rst) {
conn_event(c, conn, CLOSED);
return 1;
}
- len -= off;
- data = packet_get(p, i, off, len, NULL);
- if (!data)
- continue;
+ if ((dlen = tcp_packet_data_len(th, len)) < 0)
+ return -1;
seq = ntohl(th->seq);
if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) {
@@ -1719,7 +1729,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) &&
SEQ_GE(ack_seq, max_ack_seq)) {
/* Fast re-transmit */
- retr = !len && !th->fin &&
+ retr = !dlen && !th->fin &&
ack_seq == max_ack_seq &&
ntohs(th->window) == max_ack_seq_wnd;
@@ -1731,33 +1741,37 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (th->fin)
fin = 1;
- if (!len)
+ if (!dlen)
+ continue;
+
+ data = packet_get(p, i, th->doff * 4UL, dlen, NULL);
+ if (!data)
continue;
seq_offset = seq_from_tap - seq;
/* Use data from this buffer only in these two cases:
*
- * , seq_from_tap , seq_from_tap
- * |--------| <-- len |--------| <-- len
+ * , seq_from_tap , seq_from_tap
+ * |--------| <-- dlen |--------| <-- dlen
* '----' <-- offset ' <-- offset
* ^ seq ^ seq
- * (offset >= 0, seq + len > seq_from_tap)
+ * (offset >= 0, seq + dlen > seq_from_tap)
*
* discard in these two cases:
- * , seq_from_tap , seq_from_tap
- * |--------| <-- len |--------| <-- len
+ * , seq_from_tap , seq_from_tap
+ * |--------| <-- dlen |--------| <-- dlen
* '--------' <-- offset '-----| <- offset
* ^ seq ^ seq
- * (offset >= 0, seq + len <= seq_from_tap)
+ * (offset >= 0, seq + dlen <= seq_from_tap)
*
* keep, look for another buffer, then go back, in this case:
* , seq_from_tap
- * |--------| <-- len
+ * |--------| <-- dlen
* '===' <-- offset
* ^ seq
* (offset < 0)
*/
- if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + len, seq_from_tap))
+ if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + dlen, seq_from_tap))
continue;
if (SEQ_LT(seq_offset, 0)) {
@@ -1767,7 +1781,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
}
tcp_iov[iov_i].iov_base = data + seq_offset;
- tcp_iov[iov_i].iov_len = len - seq_offset;
+ tcp_iov[iov_i].iov_len = dlen - seq_offset;
seq_from_tap += tcp_iov[iov_i].iov_len;
iov_i++;
@@ -2078,9 +2092,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
/* Established connections not accepting data from tap */
if (conn->events & TAP_FIN_RCVD) {
- tcp_sock_consume(conn, ntohl(th->ack_seq));
- tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ bool retr;
+
+ retr = th->ack && !tcp_packet_data_len(th, len) && !th->fin &&
+ ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
+ ntohs(th->window) == conn->wnd_from_tap;
+
+ /* On socket flush failure, pretend there was no ACK, try again
+ * later
+ */
+ if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq)))
+ tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
+
+ if (retr) {
+ flow_trace(conn,
+ "fast re-transmit, ACK: %u, previous sequence: %u",
+ ntohl(th->ack_seq), conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
+ return -1;
+ }
+
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr)
tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
--
@@ -1640,6 +1640,22 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
return tcp_buf_data_from_sock(c, conn);
}
+/**
+ * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet
+ * @th: Pointer to TCP header
+ * @l4len: TCP packet length, including TCP header
+ *
+ * Return: data length of TCP packet, -1 on invalid value of Data Offset field
+ */
+static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len)
+{
+ size_t off = th->doff * 4UL;
+
+ if (off < sizeof(*th) || off > l4len)
+ return -1;
+
+ return l4len - off;
+}
/**
* tcp_data_from_tap() - tap/guest data for established connection
@@ -1671,27 +1687,21 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
for (i = idx, iov_i = 0; i < (int)p->count; i++) {
uint32_t seq, seq_offset, ack_seq;
const struct tcphdr *th;
+ ssize_t dlen;
char *data;
- size_t off;
th = packet_get(p, i, 0, sizeof(*th), &len);
if (!th)
return -1;
len += sizeof(*th);
- off = th->doff * 4UL;
- if (off < sizeof(*th) || off > len)
- return -1;
-
if (th->rst) {
conn_event(c, conn, CLOSED);
return 1;
}
- len -= off;
- data = packet_get(p, i, off, len, NULL);
- if (!data)
- continue;
+ if ((dlen = tcp_packet_data_len(th, len)) < 0)
+ return -1;
seq = ntohl(th->seq);
if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) {
@@ -1719,7 +1729,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) &&
SEQ_GE(ack_seq, max_ack_seq)) {
/* Fast re-transmit */
- retr = !len && !th->fin &&
+ retr = !dlen && !th->fin &&
ack_seq == max_ack_seq &&
ntohs(th->window) == max_ack_seq_wnd;
@@ -1731,33 +1741,37 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
if (th->fin)
fin = 1;
- if (!len)
+ if (!dlen)
+ continue;
+
+ data = packet_get(p, i, th->doff * 4UL, dlen, NULL);
+ if (!data)
continue;
seq_offset = seq_from_tap - seq;
/* Use data from this buffer only in these two cases:
*
- * , seq_from_tap , seq_from_tap
- * |--------| <-- len |--------| <-- len
+ * , seq_from_tap , seq_from_tap
+ * |--------| <-- dlen |--------| <-- dlen
* '----' <-- offset ' <-- offset
* ^ seq ^ seq
- * (offset >= 0, seq + len > seq_from_tap)
+ * (offset >= 0, seq + dlen > seq_from_tap)
*
* discard in these two cases:
- * , seq_from_tap , seq_from_tap
- * |--------| <-- len |--------| <-- len
+ * , seq_from_tap , seq_from_tap
+ * |--------| <-- dlen |--------| <-- dlen
* '--------' <-- offset '-----| <- offset
* ^ seq ^ seq
- * (offset >= 0, seq + len <= seq_from_tap)
+ * (offset >= 0, seq + dlen <= seq_from_tap)
*
* keep, look for another buffer, then go back, in this case:
* , seq_from_tap
- * |--------| <-- len
+ * |--------| <-- dlen
* '===' <-- offset
* ^ seq
* (offset < 0)
*/
- if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + len, seq_from_tap))
+ if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + dlen, seq_from_tap))
continue;
if (SEQ_LT(seq_offset, 0)) {
@@ -1767,7 +1781,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
}
tcp_iov[iov_i].iov_base = data + seq_offset;
- tcp_iov[iov_i].iov_len = len - seq_offset;
+ tcp_iov[iov_i].iov_len = dlen - seq_offset;
seq_from_tap += tcp_iov[iov_i].iov_len;
iov_i++;
@@ -2078,9 +2092,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
/* Established connections not accepting data from tap */
if (conn->events & TAP_FIN_RCVD) {
- tcp_sock_consume(conn, ntohl(th->ack_seq));
- tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
- if (tcp_tap_window_update(c, conn, ntohs(th->window)))
+ bool retr;
+
+ retr = th->ack && !tcp_packet_data_len(th, len) && !th->fin &&
+ ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
+ ntohs(th->window) == conn->wnd_from_tap;
+
+ /* On socket flush failure, pretend there was no ACK, try again
+ * later
+ */
+ if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq)))
+ tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
+
+ if (retr) {
+ flow_trace(conn,
+ "fast re-transmit, ACK: %u, previous sequence: %u",
+ ntohl(th->ack_seq), conn->seq_to_tap);
+
+ if (tcp_rewind_seq(c, conn))
+ return -1;
+ }
+
+ if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr)
tcp_data_from_sock(c, conn);
if (conn->seq_ack_from_tap == conn->seq_to_tap) {
--
2.43.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/6] tcp: FIN flags have to be retransmitted as well
2025-08-15 16:10 ` [PATCH 1/6] tcp: FIN flags have to be retransmitted as well Stefano Brivio
@ 2025-08-18 2:43 ` David Gibson
0 siblings, 0 replies; 17+ messages in thread
From: David Gibson @ 2025-08-18 2:43 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 1712 bytes --]
On Fri, Aug 15, 2025 at 06:10:37PM +0200, Stefano Brivio wrote:
> If we're retransmitting any data, and we sent a FIN segment to our
> peer, regardless of whether it was received, we obviously have to
> retransmit it as well, given that it can only come with the last data
> segment, or after it.
>
> Unconditionally clear the internal TAP_FIN_SENT flag whenever we
> re-transmit, so that we know we have to send it again, in case.
>
> Reported-by: Paul Holzinger <pholzing@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/tcp.c b/tcp.c
> index 957b498..7c1f237 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1759,6 +1759,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> "fast re-transmit, ACK: %u, previous sequence: %u",
> max_ack_seq, conn->seq_to_tap);
> conn->seq_to_tap = max_ack_seq;
> + conn->events &= ~TAP_FIN_SENT;
> if (tcp_set_peek_offset(conn, 0)) {
> tcp_rst(c, conn);
> return -1;
> @@ -2286,6 +2287,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> flow_dbg(conn, "ACK timeout, retry");
> conn->retrans++;
> conn->seq_to_tap = conn->seq_ack_from_tap;
> + conn->events &= ~TAP_FIN_SENT;
> if (!conn->wnd_from_tap)
> conn->wnd_from_tap = 1; /* Zero-window probe */
> if (tcp_set_peek_offset(conn, 0)) {
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/6] tcp: Factor sequence rewind for retransmissions into a new function
2025-08-15 16:10 ` [PATCH 2/6] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
@ 2025-08-18 2:46 ` David Gibson
0 siblings, 0 replies; 17+ messages in thread
From: David Gibson @ 2025-08-18 2:46 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 3240 bytes --]
On Fri, Aug 15, 2025 at 06:10:38PM +0200, Stefano Brivio wrote:
> ...as I'm going to need a third occurrence of this in the next change.
>
> This introduces a small functional change in tcp_data_from_tap(): the
> sequence was previously rewound to the highest ACK number we found in
> the current packet batch, and not to the current value of
> seq_ack_from_tap.
>
> The two might differ in case tcp_sock_consume() failed, because in
> that case we're ignoring that ACK altogether. But if we're ignoring
> it, it looks more correct to me to start retransmitting from an
> earlier sequence anyway.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 47 ++++++++++++++++++++++++++++++++---------------
> 1 file changed, 32 insertions(+), 15 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 7c1f237..1402ca2 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1097,6 +1097,26 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
> }
> }
>
> +/**
> + * tcp_rewind_seq() - Rewind sequence to tap and socket offset to current ACK
> + * @c: Execution context
> + * @conn: Connection pointer
> + *
> + * Return: 0 on success, -1 on failure, with connection reset
> + */
> +static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn)
> +{
> + conn->seq_to_tap = conn->seq_ack_from_tap;
> + conn->events &= ~TAP_FIN_SENT;
> +
> + if (tcp_set_peek_offset(conn, 0)) {
> + tcp_rst(c, conn);
> + return -1;
> + }
> +
> + return 0;
> +}
> +
> /**
> * tcp_prepare_flags() - Prepare header for flags-only segment (no payload)
> * @c: Execution context
> @@ -1757,13 +1777,11 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> if (retr) {
> flow_trace(conn,
> "fast re-transmit, ACK: %u, previous sequence: %u",
> - max_ack_seq, conn->seq_to_tap);
> - conn->seq_to_tap = max_ack_seq;
> - conn->events &= ~TAP_FIN_SENT;
> - if (tcp_set_peek_offset(conn, 0)) {
> - tcp_rst(c, conn);
> + conn->seq_ack_from_tap, conn->seq_to_tap);
> +
> + if (tcp_rewind_seq(c, conn))
> return -1;
> - }
> +
> tcp_data_from_sock(c, conn);
> }
>
> @@ -2285,17 +2303,16 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> tcp_rst(c, conn);
> } else {
> flow_dbg(conn, "ACK timeout, retry");
> - conn->retrans++;
> - conn->seq_to_tap = conn->seq_ack_from_tap;
> - conn->events &= ~TAP_FIN_SENT;
> +
> if (!conn->wnd_from_tap)
> conn->wnd_from_tap = 1; /* Zero-window probe */
> - if (tcp_set_peek_offset(conn, 0)) {
> - tcp_rst(c, conn);
> - } else {
> - tcp_data_from_sock(c, conn);
> - tcp_timer_ctl(c, conn);
> - }
> +
> + conn->retrans++;
> + if (tcp_rewind_seq(c, conn))
> + return;
> +
> + tcp_data_from_sock(c, conn);
> + tcp_timer_ctl(c, conn);
> }
> } else {
> struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } };
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/6] tcp: Rewind sequence when guest shrinks window to zero
2025-08-15 16:10 ` [PATCH 3/6] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
@ 2025-08-18 2:49 ` David Gibson
0 siblings, 0 replies; 17+ messages in thread
From: David Gibson @ 2025-08-18 2:49 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 3902 bytes --]
On Fri, Aug 15, 2025 at 06:10:39PM +0200, Stefano Brivio wrote:
> A window shrunk to zero means by definition that anything else that
> might be in flight is now out of window. Restart from the currently
> acknowledged sequence.
>
> Suggested-by: Jon Maloy <jmaloy@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 25 ++++++++++++++++---------
> 1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 1402ca2..dda0a2e 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1257,19 +1257,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
>
> /**
> * tcp_tap_window_update() - Process an updated window from tap side
> + * @c: Execution context
> * @conn: Connection pointer
> * @wnd: Window value, host order, unscaled
> */
> -static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> +static void tcp_tap_window_update(const struct ctx *c,
> + struct tcp_tap_conn *conn, unsigned wnd)
> {
> wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
>
> /* Work-around for bug introduced in peer kernel code, commit
> - * e2142825c120 ("net: tcp: send zero-window ACK when no memory").
> - * We don't update if window shrank to zero.
> + * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't
> + * update the window if it shrank to zero, so that we'll eventually
> + * retry to send data, but rewind the sequence as that obviously implies
> + * that no data beyond the updated window will ever be acknowledged.
> */
> - if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap))
> + if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
> + tcp_rewind_seq(c, conn);
> return;
> + }
>
> conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
>
> @@ -1694,7 +1700,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> tcp_timer_ctl(c, conn);
>
> if (p->count == 1) {
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn,
> + ntohs(th->window));
> return 1;
> }
>
> @@ -1772,7 +1779,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> if (ack && !tcp_sock_consume(conn, max_ack_seq))
> tcp_update_seqack_from_tap(c, conn, max_ack_seq);
>
> - tcp_tap_window_update(conn, max_ack_seq_wnd);
> + tcp_tap_window_update(c, conn, max_ack_seq_wnd);
>
> if (retr) {
> flow_trace(conn,
> @@ -1861,7 +1868,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c,
> const struct tcphdr *th,
> const char *opts, size_t optlen)
> {
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn, ntohs(th->window));
> tcp_get_tap_ws(conn, opts, optlen);
>
> /* First value is not scaled */
> @@ -2059,7 +2066,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> if (!th->ack)
> goto reset;
>
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn, ntohs(th->window));
>
> tcp_data_from_sock(c, conn);
>
> @@ -2071,7 +2078,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> if (conn->events & TAP_FIN_RCVD) {
> tcp_sock_consume(conn, ntohl(th->ack_seq));
> tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> - tcp_tap_window_update(conn, ntohs(th->window));
> + tcp_tap_window_update(c, conn, ntohs(th->window));
> tcp_data_from_sock(c, conn);
>
> if (conn->events & SOCK_FIN_RCVD &&
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/6] tcp: Fix closing logic for half-closed connections
2025-08-15 16:10 ` [PATCH 4/6] tcp: Fix closing logic for half-closed connections Stefano Brivio
@ 2025-08-18 2:52 ` David Gibson
0 siblings, 0 replies; 17+ messages in thread
From: David Gibson @ 2025-08-18 2:52 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 2239 bytes --]
On Fri, Aug 15, 2025 at 06:10:40PM +0200, Stefano Brivio wrote:
> First off, don't close connections half-closed by the guest before
> our own FIN is acknowledged by the guest itself.
>
> That is, after we receive a FIN from the guest (TAP_FIN_RCVD), if we
> don't have any data left to send from the socket (SOCK_FIN_RCVD, or
> EPOLLHUP), we send a FIN segment to the guest (TAP_FIN_SENT), but we
> need to actually have it acknowledged (and have no pending
> retransmissions) before we can close the connection: check for
> TAP_FIN_ACKED, first.
>
> Then, if we set TAP_FIN_SENT, and we receive an ACK segment from the
> guest, set TAP_FIN_ACKED. This was entirely missing for the
> TAP_FIN_RCVD case, and as we fix the problem described above, this
> becomes relevant as well.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index dda0a2e..aed25a9 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2081,9 +2081,14 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> tcp_tap_window_update(c, conn, ntohs(th->window));
> tcp_data_from_sock(c, conn);
>
> - if (conn->events & SOCK_FIN_RCVD &&
> - conn->seq_ack_from_tap == conn->seq_to_tap)
> - conn_event(c, conn, CLOSED);
> + if (conn->seq_ack_from_tap == conn->seq_to_tap) {
> + if (th->ack && conn->events & TAP_FIN_SENT)
> + conn_event(c, conn, TAP_FIN_ACKED);
> +
> + if (conn->events & SOCK_FIN_RCVD &&
> + conn->events & TAP_FIN_ACKED)
> + conn_event(c, conn, CLOSED);
> + }
>
> return 1;
> }
> @@ -2363,7 +2368,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> return;
> }
>
> - if ((conn->events & TAP_FIN_SENT) && (events & EPOLLHUP)) {
> + if ((conn->events & TAP_FIN_ACKED) && (events & EPOLLHUP)) {
> conn_event(c, conn, CLOSED);
> return;
> }
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero
2025-08-15 16:10 ` [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
@ 2025-08-18 2:56 ` David Gibson
2025-08-18 17:06 ` Stefano Brivio
0 siblings, 1 reply; 17+ messages in thread
From: David Gibson @ 2025-08-18 2:56 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 3240 bytes --]
On Fri, Aug 15, 2025 at 06:10:41PM +0200, Stefano Brivio wrote:
> If the peer shrinks the window to zero, we'll skip storing the new
> window, as a convenient way to cause window probes (which exceed any
> zero-sized window, strictly speaking) if we don't get window updates
> in a while.
Strictly speaking, not storing the new zero window feels slightly
wrong to me - I wonder if it would be more correct to store the zero
window, but still send window probes as a special case.
> As we do so, though, we need to ensure we don't try to queue more data
> from the socket right after we process this window update, as the
> entire point of a zero-window advertisement is to keep us from sending
> more data.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For the meantime, though, I'm reasonably confident that this is still
an improvement, so,
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 16 +++++++++-------
> 1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index aed25a9..624e7f4 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1260,8 +1260,10 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
> * @c: Execution context
> * @conn: Connection pointer
> * @wnd: Window value, host order, unscaled
> + *
> + * Return: false on zero window (not stored to wnd_from_tap), true otherwise
> */
> -static void tcp_tap_window_update(const struct ctx *c,
> +static bool tcp_tap_window_update(const struct ctx *c,
> struct tcp_tap_conn *conn, unsigned wnd)
> {
> wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
> @@ -1274,13 +1276,14 @@ static void tcp_tap_window_update(const struct ctx *c,
> */
> if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) {
> tcp_rewind_seq(c, conn);
> - return;
> + return false;
> }
>
> conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
>
> /* FIXME: reflect the tap-side receiver's window back to the sock-side
> * sender by adjusting SO_RCVBUF? */
> + return true;
> }
>
> /**
> @@ -2066,9 +2069,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> if (!th->ack)
> goto reset;
>
> - tcp_tap_window_update(c, conn, ntohs(th->window));
> -
> - tcp_data_from_sock(c, conn);
> + if (tcp_tap_window_update(c, conn, ntohs(th->window)))
> + tcp_data_from_sock(c, conn);
>
> if (p->count - idx == 1)
> return 1;
> @@ -2078,8 +2080,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> if (conn->events & TAP_FIN_RCVD) {
> tcp_sock_consume(conn, ntohl(th->ack_seq));
> tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> - tcp_tap_window_update(c, conn, ntohs(th->window));
> - tcp_data_from_sock(c, conn);
> + if (tcp_tap_window_update(c, conn, ntohs(th->window)))
> + tcp_data_from_sock(c, conn);
>
> if (conn->seq_ack_from_tap == conn->seq_to_tap) {
> if (th->ack && conn->events & TAP_FIN_SENT)
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 6/6] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent
2025-08-15 16:10 ` [PATCH 6/6] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
@ 2025-08-18 3:04 ` David Gibson
0 siblings, 0 replies; 17+ messages in thread
From: David Gibson @ 2025-08-18 3:04 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 7073 bytes --]
On Fri, Aug 15, 2025 at 06:10:42PM +0200, Stefano Brivio wrote:
> We currently have a number of discrepancies in the tcp_tap_handler()
> path between the half-closed connection path and the regular one, and
> they are mostly a result of code duplication, which comes in turn from
> the fact that tcp_data_from_tap() deals with data transfers as well as
> general connection bookkeeping, so we can't use it for half-closed
> connections.
>
> This suggests that we should probably rework it into two or more
> functions, in the long term,
Agreed.
> but for the moment being I'm just fixing
> one obvious issue, which is the lack of fast retransmissions in the
> TAP_FIN_RCVD path, and a potential one, which is the fact we don't
> handle socket flush failures.
Fair enough for the time being.
> Add fast re-transmit for half-closed connections, and extract the
> logic to determine the TCP payload length from tcp_data_from_tap()
> into the new tcp_packet_data_len() helper to decrease a bit the amount
> of resulting code duplication.
>
> Handle the case of socket flush (tcp_sock_consume()) flush failure in
> the same way as tcp_data_from_tap() handles it.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 79 ++++++++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 56 insertions(+), 23 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 624e7f4..163f820 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1640,6 +1640,22 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>
> return tcp_buf_data_from_sock(c, conn);
> }
> +/**
> + * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet
> + * @th: Pointer to TCP header
> + * @l4len: TCP packet length, including TCP header
> + *
> + * Return: data length of TCP packet, -1 on invalid value of Data Offset field
> + */
> +static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len)
> +{
> + size_t off = th->doff * 4UL;
> +
> + if (off < sizeof(*th) || off > l4len)
> + return -1;
> +
> + return l4len - off;
> +}
>
> /**
> * tcp_data_from_tap() - tap/guest data for established connection
> @@ -1671,27 +1687,21 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> for (i = idx, iov_i = 0; i < (int)p->count; i++) {
> uint32_t seq, seq_offset, ack_seq;
> const struct tcphdr *th;
> + ssize_t dlen;
> char *data;
> - size_t off;
>
> th = packet_get(p, i, 0, sizeof(*th), &len);
> if (!th)
> return -1;
> len += sizeof(*th);
>
> - off = th->doff * 4UL;
> - if (off < sizeof(*th) || off > len)
> - return -1;
> -
> if (th->rst) {
> conn_event(c, conn, CLOSED);
> return 1;
> }
>
> - len -= off;
> - data = packet_get(p, i, off, len, NULL);
> - if (!data)
> - continue;
> + if ((dlen = tcp_packet_data_len(th, len)) < 0)
> + return -1;
>
> seq = ntohl(th->seq);
> if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) {
> @@ -1719,7 +1729,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) &&
> SEQ_GE(ack_seq, max_ack_seq)) {
> /* Fast re-transmit */
> - retr = !len && !th->fin &&
> + retr = !dlen && !th->fin &&
> ack_seq == max_ack_seq &&
> ntohs(th->window) == max_ack_seq_wnd;
>
> @@ -1731,33 +1741,37 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> if (th->fin)
> fin = 1;
>
> - if (!len)
> + if (!dlen)
> + continue;
> +
> + data = packet_get(p, i, th->doff * 4UL, dlen, NULL);
> + if (!data)
> continue;
>
> seq_offset = seq_from_tap - seq;
> /* Use data from this buffer only in these two cases:
> *
> - * , seq_from_tap , seq_from_tap
> - * |--------| <-- len |--------| <-- len
> + * , seq_from_tap , seq_from_tap
> + * |--------| <-- dlen |--------| <-- dlen
> * '----' <-- offset ' <-- offset
> * ^ seq ^ seq
> - * (offset >= 0, seq + len > seq_from_tap)
> + * (offset >= 0, seq + dlen > seq_from_tap)
> *
> * discard in these two cases:
> - * , seq_from_tap , seq_from_tap
> - * |--------| <-- len |--------| <-- len
> + * , seq_from_tap , seq_from_tap
> + * |--------| <-- dlen |--------| <-- dlen
> * '--------' <-- offset '-----| <- offset
> * ^ seq ^ seq
> - * (offset >= 0, seq + len <= seq_from_tap)
> + * (offset >= 0, seq + dlen <= seq_from_tap)
> *
> * keep, look for another buffer, then go back, in this case:
> * , seq_from_tap
> - * |--------| <-- len
> + * |--------| <-- dlen
> * '===' <-- offset
> * ^ seq
> * (offset < 0)
> */
> - if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + len, seq_from_tap))
> + if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + dlen, seq_from_tap))
> continue;
>
> if (SEQ_LT(seq_offset, 0)) {
> @@ -1767,7 +1781,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> }
>
> tcp_iov[iov_i].iov_base = data + seq_offset;
> - tcp_iov[iov_i].iov_len = len - seq_offset;
> + tcp_iov[iov_i].iov_len = dlen - seq_offset;
> seq_from_tap += tcp_iov[iov_i].iov_len;
> iov_i++;
>
> @@ -2078,9 +2092,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>
> /* Established connections not accepting data from tap */
> if (conn->events & TAP_FIN_RCVD) {
> - tcp_sock_consume(conn, ntohl(th->ack_seq));
> - tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> - if (tcp_tap_window_update(c, conn, ntohs(th->window)))
> + bool retr;
> +
> + retr = th->ack && !tcp_packet_data_len(th, len) && !th->fin &&
> + ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
> + ntohs(th->window) == conn->wnd_from_tap;
> +
> + /* On socket flush failure, pretend there was no ACK, try again
> + * later
> + */
> + if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq)))
> + tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
> +
> + if (retr) {
> + flow_trace(conn,
> + "fast re-transmit, ACK: %u, previous sequence: %u",
> + ntohl(th->ack_seq), conn->seq_to_tap);
> +
> + if (tcp_rewind_seq(c, conn))
> + return -1;
> + }
> +
> + if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr)
> tcp_data_from_sock(c, conn);
>
> if (conn->seq_ack_from_tap == conn->seq_to_tap) {
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero
2025-08-18 2:56 ` David Gibson
@ 2025-08-18 17:06 ` Stefano Brivio
2025-08-19 0:55 ` David Gibson
0 siblings, 1 reply; 17+ messages in thread
From: Stefano Brivio @ 2025-08-18 17:06 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Jon Maloy, Paul Holzinger
On Mon, 18 Aug 2025 12:56:09 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Fri, Aug 15, 2025 at 06:10:41PM +0200, Stefano Brivio wrote:
> > If the peer shrinks the window to zero, we'll skip storing the new
> > window, as a convenient way to cause window probes (which exceed any
> > zero-sized window, strictly speaking) if we don't get window updates
> > in a while.
>
> Strictly speaking, not storing the new zero window feels slightly
> wrong to me - I wonder if it would be more correct to store the zero
> window, but still send window probes as a special case.
Yeah, it's wrong but it's not really obvious what value we should use
at that point. With a recent kernel version, there would be no problem
storing 0, waiting for a window update which will come soon after, and
reserve the special case I already added (1 as window size) for
zero-window probes, triggered via timer.
But without 8c670bdfa58e ("tcp: correct handling of extreme memory
squeeze"), and with e2142825c120 ("net: tcp: send zero-window ACK when
no memory"), the window update never comes, so we would need to wait
for the timer to expire before we send an explicit zero-window probe,
which takes a while (two seconds).
With this trick, instead, we keep the latest advertised value, which
is actually the latest correct value, and retry sending what we were
originally sending as soon as we have more data from the socket, which
would becomes available quite soon in all the practical cases where the
kernel shrinks the window to zero.
Maybe we could store zero, and use WINDOW_DEFAULT (14600 bytes) on
socket activity. The RFC simply says we need to send zero-window
probes at some point, not that we should assume a broken receiver,
but unfortunately between e2142825c120 and 8c670bdfa58e it's like that.
--
Stefano
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
` (5 preceding siblings ...)
2025-08-15 16:10 ` [PATCH 6/6] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
@ 2025-08-18 17:40 ` Paul Holzinger
2025-08-18 21:58 ` Stefano Brivio
6 siblings, 1 reply; 17+ messages in thread
From: Paul Holzinger @ 2025-08-18 17:40 UTC (permalink / raw)
To: Stefano Brivio, passt-dev; +Cc: Jon Maloy
Hi,
On 15/08/2025 18:10, Stefano Brivio wrote:
> Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger
> sk_rcvbuf checks"), window limits are enforced more aggressively with
> a bigger amount of zero-window updates compared to what happened with
> e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone,
> and occasional duplicate ACKs can now be seen also for local transfers
> with default (208 KiB) socket buffer sizes.
>
> Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the
> pasta integration occasionally fail on the "TCP/IPv4 large transfer,
> tap" case.
>
> While playing with a reproducer that seems to be matching those
> failures:
>
> while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
>
> and a kernel including that commit, I hit a few different failures,
> that should be fixed by this series.
>
> Stefano Brivio (6):
> tcp: FIN flags have to be retransmitted as well
> tcp: Factor sequence rewind for retransmissions into a new function
> tcp: Rewind sequence when guest shrinks window to zero
> tcp: Fix closing logic for half-closed connections
> tcp: Don't try to transmit right after the peer shrank the window to
> zero
> tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path
> consistent
>
> tcp.c | 170 ++++++++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 118 insertions(+), 52 deletions(-)
I applied the series in my test VM and run the reproducer from the
podman tests where I noticed the flake.
It failed after about 10mins and then when I enabled pasta trace logs
and tcpdump captures then it took 208 mins for me to reproduce so it
doesn't seem to fix the issue I am seeing.
I did send the pasta log/pcap files in private to Stefano for further
debugging.
--
Paul Holzinger
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels
2025-08-18 17:40 ` [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger
@ 2025-08-18 21:58 ` Stefano Brivio
0 siblings, 0 replies; 17+ messages in thread
From: Stefano Brivio @ 2025-08-18 21:58 UTC (permalink / raw)
To: Paul Holzinger; +Cc: passt-dev, Jon Maloy
[-- Attachment #1: Type: text/plain, Size: 7827 bytes --]
On Mon, 18 Aug 2025 19:40:39 +0200
Paul Holzinger <pholzing@redhat.com> wrote:
> Hi,
>
> On 15/08/2025 18:10, Stefano Brivio wrote:
> > Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger
> > sk_rcvbuf checks"), window limits are enforced more aggressively with
> > a bigger amount of zero-window updates compared to what happened with
> > e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone,
> > and occasional duplicate ACKs can now be seen also for local transfers
> > with default (208 KiB) socket buffer sizes.
> >
> > Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the
> > pasta integration occasionally fail on the "TCP/IPv4 large transfer,
> > tap" case.
> >
> > While playing with a reproducer that seems to be matching those
> > failures:
> >
> > while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
> >
> > and a kernel including that commit, I hit a few different failures,
> > that should be fixed by this series.
> >
> > Stefano Brivio (6):
> > tcp: FIN flags have to be retransmitted as well
> > tcp: Factor sequence rewind for retransmissions into a new function
> > tcp: Rewind sequence when guest shrinks window to zero
> > tcp: Fix closing logic for half-closed connections
> > tcp: Don't try to transmit right after the peer shrank the window to
> > zero
> > tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path
> > consistent
> >
> > tcp.c | 170 ++++++++++++++++++++++++++++++++++++++++------------------
> > 1 file changed, 118 insertions(+), 52 deletions(-)
> I applied the series in my test VM and run the reproducer from the
> podman tests where I noticed the flake.
>
> It failed after about 10mins and then when I enabled pasta trace logs
> and tcpdump captures then it took 208 mins for me to reproduce so it
> doesn't seem to fix the issue I am seeing.
>
> I did send the pasta log/pcap files in private to Stefano for further
> debugging.
Here's a quick summary of the issue by the way, if somebody is curious.
Excerpt of packet capture and pasta.log attached as well. Commented
capture:
$ tshark -r pasta-container.pcap -td -Y 'frame.number in { 345 .. 369 }'
345 0.000018 192.168.122.100 → 169.254.1.2 60 TCP 5978 → 51626
[ACK] Seq=1 Ack=10086371 Win=140288 Len=0 346 0.000014 169.254.1.2 → 192.168.122.100 65540 TCP 51626 → 5978 [PSH, ACK] Seq=10086371 Ack=1 Win=65536 Len=65480
347 0.000024 169.254.1.2 → 192.168.122.100 740 TCP 51626 → 5978 [PSH, ACK] Seq=10151851 Ack=1 Win=65536 Len=680
348 0.000031 169.254.1.2 → 192.168.122.100 65540 TCP 51626 → 5978 [PSH, ACK] Seq=10152531 Ack=1 Win=65536 Len=65480
349 0.000032 169.254.1.2 → 192.168.122.100 8708 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10218011 Ack=1 Win=65536 Len=8648
350 0.000005 192.168.122.100 → 169.254.1.2 60 TCP 5978 → 51626 [ACK] Seq=1 Ack=10218011 Win=8704 Len=0
351 0.000128 192.168.122.100 → 169.254.1.2 60 TCP 5978 → 51626 [ACK] Seq=1 Ack=10226659 Win=65536 Len=0
352 0.000051 169.254.1.2 → 192.168.122.100 116 TCP 51626 → 5978 [PSH, ACK] Seq=10226659 Ack=1 Win=65536 Len=56
353 0.000038 169.254.1.2 → 192.168.122.100 65540 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10226715 Ack=1 Win=65536 Len=65480
354 0.000021 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10292195 Win=0 Len=0
355 0.000122 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10292195 Win=110592 Len=0
356 0.000067 169.254.1.2 → 192.168.122.100 65540 TCP 51626 → 5978 [PSH, ACK] Seq=10292195 Ack=1 Win=65536 Len=65480
357 0.000031 169.254.1.2 → 192.168.122.100 45172 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10357675 Ack=1 Win=65536 Len=45112
358 0.000038 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10357675 Win=0 Len=0
359 0.000023 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10357675 Win=69632 Len=0
360 0.000037 169.254.1.2 → 192.168.122.100 65540 TCP [TCP Out-Of-Order] 51626 → 5978 [PSH, ACK] Seq=10357675 Ack=1 Win=65536 Len=65480
...so far so good. What tshark marks as out-of-order is a result of the
kernel shrinking the window and us retransmitting from there (that's patch
3/6). Then this happens:
--
361 0.000018 169.254.1.2 → 192.168.122.100 4212 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10423155 Ack=1 Win=65536 Len=4152
- we send some data
--
362 0.000036 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=0 Len=0
- the kernel shrinks the window: according to frame #359, we could send up
to sequence byte 10357675 + 69632 = 10427307, and so we did that in frame
#361 (10423155 + 4152 = 10427307), but now that's "too much" (this looks
pathological and maybe we should have a second / third / fourth attempt
at trying to fix the kernel...)
--
363 0.000066 169.254.1.2 → 192.168.122.100 62666 TCP [TCP Out-Of-Order] 51626 → 5978 [PSH, ACK] Seq=10423155 Ack=1 Win=65536 Len=62606
- we have more data from the socket, corresponding pasta.log excerpt:
0.8967: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8967: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8968: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS dropped
0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8968: Flow 0 (TCP connection): timer expires in 2.000s
and we decide to send some, out of window. This is intended. However:
--
364 0.000006 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=0 Len=0
...the kernel didn't reopen the window yet
--
365 0.000058 169.254.1.2 → 192.168.122.100 60 TCP 51626 → 5978 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0
but we send a FIN with an updated sequence (this looks wrong):
0.8968: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8968: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8968: Flow 0 (TCP connection): flag at tcp_prepare_flags:1195
0.8968: Flow 0 (TCP connection): flag at tcp_prepare_flags:1201
0.8968: Flow 0 (TCP connection): timer expires in 2.000s
0.8968: Flow 0 (TCP connection): event at tcp_buf_data_from_sock:379
0.8968: Flow 0 (TCP connection): TAP_FIN_SENT: CLOSE_WAIT -> LAST_ACK
--
366 0.000003 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=0 Len=0
same as before...
--
367 0.000021 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=73216 Len=0
368 0.000119 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=155136 Len=0
and finally we have window updates, but we're in LAST_ACK and for whatever
reason we never send the missing data. Things time out after two minutes.
--
I guess I missed to rewind the sequence in some cases where the window
shrinks, or something else to that effect.
--
Stefano
[-- Attachment #2: pasta-container-end.pcap.gz --]
[-- Type: application/gzip, Size: 128892 bytes --]
[-- Attachment #3: pasta-end.log --]
[-- Type: text/x-log, Size: 13804 bytes --]
0.8960: pasta: epoll event on connected TCP socket 81 (events: 0x00000001)
0.8960: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8960: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8960: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8960: Flow 0 (TCP connection): timer expires in 2.000s
0.8961: pasta: epoll event on connected TCP socket 81 (events: 0x00000001)
0.8961: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8961: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8961: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8961: Flow 0 (TCP connection): timer expires in 2.000s
0.8961: pasta: epoll event on connected TCP socket 81 (events: 0x00000001)
0.8961: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:314
0.8961: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS
0.8961: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:315
0.8961: Flow 0 (TCP connection): STALLED
0.8961: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:316
0.8961: Flow 0 (TCP connection): timer expires in 2.000s
0.8961: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8961: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8961: Flow 0 (TCP connection): packet length 20 from tap
0.8962: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1093
0.8962: Flow 0 (TCP connection): timer expires in 2.000s
0.8962: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8962: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8962: Flow 0 (TCP connection): STALLED dropped
0.8962: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8962: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8962: Flow 0 (TCP connection): SOCK_FIN_RCVD: ESTABLISHED -> CLOSE_WAIT
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8963: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS dropped
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8963: Flow 0 (TCP connection): timer expires in 2.000s
0.8963: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8963: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8963: Flow 0 (TCP connection): packet length 20 from tap
0.8963: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1093
0.8963: Flow 0 (TCP connection): timer expires in 2.000s
0.8963: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8963: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8963: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8963: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8963: Flow 0 (TCP connection): timer expires in 2.000s
0.8963: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8963: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:314
0.8963: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:315
0.8963: Flow 0 (TCP connection): STALLED
0.8963: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:316
0.8963: Flow 0 (TCP connection): timer expires in 2.000s
0.8963: pasta: epoll event on connected TCP socket 81 (events: 0x00002000)
0.8963: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8963: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8963: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8963: Flow 0 (TCP connection): packet length 20 from tap
0.8963: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1088
0.8963: Flow 0 (TCP connection): ACK_FROM_TAP_DUE dropped
0.8963: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8963: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8964: Flow 0 (TCP connection): STALLED dropped
0.8964: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8964: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8964: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:314
0.8964: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:315
0.8964: Flow 0 (TCP connection): STALLED
0.8964: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:316
0.8964: Flow 0 (TCP connection): ACK_FROM_TAP_DUE
0.8964: Flow 0 (TCP connection): timer expires in 2.000s
0.8964: pasta: epoll event on connected TCP socket 81 (events: 0x00002000)
0.8964: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8965: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8965: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8965: Flow 0 (TCP connection): packet length 20 from tap
0.8965: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1088
0.8965: Flow 0 (TCP connection): ACK_FROM_TAP_DUE dropped
0.8965: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8965: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8965: Flow 0 (TCP connection): STALLED dropped
0.8965: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8965: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8965: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8965: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS dropped
0.8965: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8965: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8965: Flow 0 (TCP connection): ACK_FROM_TAP_DUE
0.8965: Flow 0 (TCP connection): timer expires in 2.000s
0.8965: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8965: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8965: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8965: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8965: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8965: Flow 0 (TCP connection): timer expires in 2.000s
0.8966: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8966: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8966: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:314
0.8966: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS
0.8966: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:315
0.8966: Flow 0 (TCP connection): STALLED
0.8966: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:316
0.8966: Flow 0 (TCP connection): timer expires in 2.000s
0.8966: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8966: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8966: Flow 0 (TCP connection): packet length 20 from tap
0.8966: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1093
0.8966: Flow 0 (TCP connection): timer expires in 2.000s
0.8966: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8966: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8966: Flow 0 (TCP connection): STALLED dropped
0.8966: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8966: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8966: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8966: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS dropped
0.8966: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8966: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8966: Flow 0 (TCP connection): timer expires in 2.000s
0.8966: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8966: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8966: Flow 0 (TCP connection): packet length 20 from tap
0.8966: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8966: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8967: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8967: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8967: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8967: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8967: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8967: Flow 0 (TCP connection): timer expires in 2.000s
0.8967: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8967: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8967: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:314
0.8967: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS
0.8967: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:315
0.8967: Flow 0 (TCP connection): STALLED
0.8967: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:316
0.8967: Flow 0 (TCP connection): timer expires in 2.000s
0.8967: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8967: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8967: Flow 0 (TCP connection): packet length 20 from tap
0.8967: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1093
0.8967: Flow 0 (TCP connection): timer expires in 2.000s
0.8967: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8967: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8967: Flow 0 (TCP connection): STALLED dropped
0.8967: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8967: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393
0.8968: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS dropped
0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394
0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418
0.8968: Flow 0 (TCP connection): timer expires in 2.000s
0.8968: pasta: epoll event on connected TCP socket 81 (events: 0x00002001)
0.8968: Flow 0 (TCP connection): event at tcp_sock_handler:2416
0.8968: Flow 0 (TCP connection): flag at tcp_prepare_flags:1195
0.8968: Flow 0 (TCP connection): flag at tcp_prepare_flags:1201
0.8968: Flow 0 (TCP connection): timer expires in 2.000s
0.8968: Flow 0 (TCP connection): event at tcp_buf_data_from_sock:379
0.8968: Flow 0 (TCP connection): TAP_FIN_SENT: CLOSE_WAIT -> LAST_ACK
0.8968: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8968: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8968: Flow 0 (TCP connection): packet length 20 from tap
0.8968: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8968: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8968: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8969: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8969: Flow 0 (TCP connection): packet length 20 from tap
0.8969: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1088
0.8969: Flow 0 (TCP connection): ACK_FROM_TAP_DUE dropped
0.8969: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8969: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8969: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8969: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8969: Flow 0 (TCP connection): packet length 20 from tap
0.8969: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1088
0.8969: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8969: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.8970: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
0.8970: tap: protocol 6, 192.168.122.100:5978 -> 169.254.1.2:51626 (1 packet)
0.8970: Flow 0 (TCP connection): packet length 20 from tap
0.8970: Flow 0 (TCP connection): flag at tcp_update_seqack_from_tap:1088
0.8970: Flow 0 (TCP connection): flag at tcp_prepare_flags:1143
0.8970: Flow 0 (TCP connection): flag at tcp_tap_handler:2136
0.9880: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
2.0120: pasta: epoll event on /dev/net/tun device 9 (events: 0x00000001)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero
2025-08-18 17:06 ` Stefano Brivio
@ 2025-08-19 0:55 ` David Gibson
0 siblings, 0 replies; 17+ messages in thread
From: David Gibson @ 2025-08-19 0:55 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Jon Maloy, Paul Holzinger
[-- Attachment #1: Type: text/plain, Size: 3082 bytes --]
On Mon, Aug 18, 2025 at 07:06:29PM +0200, Stefano Brivio wrote:
> On Mon, 18 Aug 2025 12:56:09 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Fri, Aug 15, 2025 at 06:10:41PM +0200, Stefano Brivio wrote:
> > > If the peer shrinks the window to zero, we'll skip storing the new
> > > window, as a convenient way to cause window probes (which exceed any
> > > zero-sized window, strictly speaking) if we don't get window updates
> > > in a while.
> >
> > Strictly speaking, not storing the new zero window feels slightly
> > wrong to me - I wonder if it would be more correct to store the zero
> > window, but still send window probes as a special case.
>
> Yeah, it's wrong but it's not really obvious what value we should use
> at that point. With a recent kernel version, there would be no problem
> storing 0, waiting for a window update which will come soon after, and
> reserve the special case I already added (1 as window size) for
> zero-window probes, triggered via timer.
>
> But without 8c670bdfa58e ("tcp: correct handling of extreme memory
> squeeze"), and with e2142825c120 ("net: tcp: send zero-window ACK when
> no memory"), the window update never comes, so we would need to wait
> for the timer to expire before we send an explicit zero-window probe,
> which takes a while (two seconds).
Right. I guess what I had in mind was to store the zero window, but
also set up a new timer to send a new window-probe in a more
reasonable timeframe.
> With this trick, instead, we keep the latest advertised value, which
> is actually the latest correct value, and retry sending what we were
> originally sending as soon as we have more data from the socket, which
> would becomes available quite soon in all the practical cases where the
> kernel shrinks the window to zero.
Ah, right. So, IIUC a part of the issue is that at this point if we
see a zero window we don't know if it's because *we* filled the window
in the normal way - and so should wait for an update - or because the
peer slammed the window shut - in which case we need to reprobe. I
mean, I guess we could set a flag for that earlier, but that's fairly
awkward.
At least for the time being, I think the current approach of keeping
the previous window is preferable.
> Maybe we could store zero, and use WINDOW_DEFAULT (14600 bytes) on
> socket activity. The RFC simply says we need to send zero-window
> probes at some point, not that we should assume a broken receiver,
> but unfortunately between e2142825c120 and 8c670bdfa58e it's like that.
I think the existing behaviour of keeping the previous window is
preferable. At least for the specific case of a transient memory
error, that seems more likely to be right. Maybe? Unless we think we
should slow down in the hopes of not exhausting the peer's memory again.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-08-19 0:55 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-15 16:10 [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
2025-08-15 16:10 ` [PATCH 1/6] tcp: FIN flags have to be retransmitted as well Stefano Brivio
2025-08-18 2:43 ` David Gibson
2025-08-15 16:10 ` [PATCH 2/6] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
2025-08-18 2:46 ` David Gibson
2025-08-15 16:10 ` [PATCH 3/6] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
2025-08-18 2:49 ` David Gibson
2025-08-15 16:10 ` [PATCH 4/6] tcp: Fix closing logic for half-closed connections Stefano Brivio
2025-08-18 2:52 ` David Gibson
2025-08-15 16:10 ` [PATCH 5/6] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
2025-08-18 2:56 ` David Gibson
2025-08-18 17:06 ` Stefano Brivio
2025-08-19 0:55 ` David Gibson
2025-08-15 16:10 ` [PATCH 6/6] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
2025-08-18 3:04 ` David Gibson
2025-08-18 17:40 ` [PATCH 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger
2025-08-18 21:58 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).