[net, v2] tcp: correct handling of extreme memory squeeze

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [net, v2] tcp: correct handling of extreme memory squeeze
@ 2025-01-16  2:29 Jon Maloy
  2025-01-16 21:14 ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: Jon Maloy @ 2025-01-16  2:29 UTC (permalink / raw)
  To: passt-dev, sbrivio, lvivier, dgibson, jmaloy

Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations. The problem occurs on the server side, and
the socket in question is a completely regular socket with the
default settings for the fedora40 kernel. We do not use SO_PEEK
or SO_RCV_BUF on this socket.

The following excerpt of a logging session, with own comments added,
shows what is happening:

//              tcp_v4_rcv(<-)
//                tcp_rcv_established(<-)
[5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]:     tcp_data_queue(->)
[5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
                       [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
                       [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
[5201<->39222]:     tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
[5201<->39222]:     __tcp_transmit_skb(->)
[5201<->39222]:       tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->39222]:       tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
[5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]:     __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:   tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)

// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
                [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]

// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
                [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]

// The above pattern repeats again and again, since nothing changes
// between the reads.

[...]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
                [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
                [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]

// The receive queue is empty, but no new advertisement is sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.

Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.

We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
has been dropped because of memory exhaustion and we have to advertize
a zero window.

Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v1: -Posted on Apr 6, 2024
v2: -Improved commit log to clarify how we end up in this situation.
    -After feedback from Eric Dumazet, removed references to use of
     SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
     about how this situation occurs. Those flags are used at the
     peer side's incoming connection, and not on this one.
---
 net/ipv4/tcp_output.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..ba295f798e5e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [net, v2] tcp: correct handling of extreme memory squeeze
  2025-01-16  2:29 [net, v2] tcp: correct handling of extreme memory squeeze Jon Maloy
@ 2025-01-16 21:14 ` Stefano Brivio
  0 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-16 21:14 UTC (permalink / raw)
  To: Jon Maloy; +Cc: passt-dev, lvivier, dgibson

On Wed, 15 Jan 2025 21:29:18 -0500
Jon Maloy <jmaloy@redhat.com> wrote:

> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations. The problem occurs on the server side, and
> the socket in question is a completely regular socket with the
> default settings for the fedora40 kernel. We do not use SO_PEEK
> or SO_RCV_BUF on this socket.

SO_RCVBUF

I think what's missing, at this point, is a short description of the
issue.

That is (if I recall correctly), under memory pressure, we temporarily
advertise a zero-sized window, but this is not stored as part of socket
data, the reasoning being that it's a temporary setting which shouldn't
influence any further calculation.

However, by not storing this information in socket data, we also fail
to advertise a non-zero window once we have enough memory to receive
data.

Our notion of the window size is now different from what we
advertised to the peer. The peer won't send any data as a result, but
we won't recalculate the window, either, as long as we don't receive
any.

The rest looks good and very reasonable to me.

-- 
Stefano

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [net,v2] tcp: correct handling of extreme memory squeeze
@ 2025-01-17 21:40 jmaloy
  2025-01-17 22:09 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: jmaloy @ 2025-01-17 21:40 UTC (permalink / raw)
  To: netdev, davem
  Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson, imagedong,
	eric.dumazet, edumazet

From: Jon Maloy <jmaloy@redhat.com>

Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations.

Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.

However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.

The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.

The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:

//              tcp_v4_rcv(->)
//                tcp_rcv_established(->)
[5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]:     tcp_data_queue(->)
[5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
                       [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                       [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
[5201<->39222]:     tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
[5201<->39222]:     __tcp_transmit_skb(->)
[5201<->39222]:       tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->39222]:       tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
[5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]:     __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:   tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)

// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]

// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]

// The above pattern repeats again and again, since nothing changes
// between the reads.

[...]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
[5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
                [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]

// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.

Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.

We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
has been dropped because of memory exhaustion and we have to advertize
a zero window.

Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v1: -Posted on Apr 6, 2024
v2: -Improved commit log to clarify how we end up in this situation.
    -After feedback from Eric Dumazet, removed references to use of
     SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
     about how this situation occurs. Those flags are used at the
     peer side's incoming connection, and not on this one.
---
 net/ipv4/tcp_output.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..ba295f798e5e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-17 21:40 [net,v2] " jmaloy
@ 2025-01-17 22:09 ` Eric Dumazet
  2025-01-17 22:27   ` Stefano Brivio
  2025-01-18 17:01 ` Jason Xing
  2025-01-18 20:04 ` Neal Cardwell
  2 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2025-01-17 22:09 UTC (permalink / raw)
  To: jmaloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	imagedong, eric.dumazet

On Fri, Jan 17, 2025 at 10:40 PM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> [5201<->39222]:     tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
> [5201<->39222]:     __tcp_transmit_skb(->)
> [5201<->39222]:       tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->39222]:       tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]:     __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:   tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
>
> Furthermore, we have observed that in these situations this side may
> send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
> as it should be. Backing ack_seq seems to be harmless, but is of
> course still wrong from a protocol viewpoint.
>
> We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
> has been dropped because of memory exhaustion and we have to advertize
> a zero window.
>
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
>
> Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
> v1: -Posted on Apr 6, 2024

Could you post the link, this was a long time ago and I forgot the context.

> v2: -Improved commit log to clarify how we end up in this situation.
>     -After feedback from Eric Dumazet, removed references to use of
>      SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
>      about how this situation occurs. Those flags are used at the
>      peer side's incoming connection, and not on this one.
> ---
>  net/ipv4/tcp_output.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..ba295f798e5e 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
>         u32 cur_win, new_win;
>
>         /* Make the window 0 if we failed to queue the data because we
> -        * are out of memory. The window is temporary, so we don't store
> -        * it on the socket.
> +        * are out of memory.
>          */
> -       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> +       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> +               tp->rcv_wnd = 0;
> +               tp->rcv_wup = tp->rcv_nxt;

I wonder if we should not clear tp->pred_flags here ?

Also, any chance you could provide a packetdrill test ?

Your changelog contains traces that are hard to follow.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-17 22:09 ` Eric Dumazet
@ 2025-01-17 22:27   ` Stefano Brivio
  0 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-17 22:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: jmaloy, netdev, davem, kuba, passt-dev, lvivier, dgibson,
	Menglong Dong, eric.dumazet

[Fixed Cc: for Menglong Dong, this is a reply to:
 https://lore.kernel.org/all/CANn89i+Ks52JVTBsMFQBM4CqUR4cegXhbSCH77aMCqFpd-S_1A@mail.gmail.com/]

On Fri, 17 Jan 2025 23:09:27 +0100
Eric Dumazet <edumazet@google.com> wrote:

> On Fri, Jan 17, 2025 at 10:40 PM <jmaloy@redhat.com> wrote:
>
> > v1: -Posted on Apr 6, 2024  
> 
> Could you post the link, this was a long time ago and I forgot the context.

https://lore.kernel.org/all/20240406182107.261472-1-jmaloy@redhat.com/#r

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-17 21:40 [net,v2] " jmaloy
  2025-01-17 22:09 ` Eric Dumazet
@ 2025-01-18 17:01 ` Jason Xing
  2025-01-18 20:04 ` Neal Cardwell
  2 siblings, 0 replies; 20+ messages in thread
From: Jason Xing @ 2025-01-18 17:01 UTC (permalink / raw)
  To: jmaloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	imagedong, eric.dumazet, edumazet

On Sat, Jan 18, 2025 at 5:41 AM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> [5201<->39222]:     tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
> [5201<->39222]:     __tcp_transmit_skb(->)
> [5201<->39222]:       tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->39222]:       tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]:     __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:   tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
> [5201<->39222]:   __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
>                 [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                 [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
>
> Furthermore, we have observed that in these situations this side may
> send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
> as it should be. Backing ack_seq seems to be harmless, but is of
> course still wrong from a protocol viewpoint.
>
> We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
> has been dropped because of memory exhaustion and we have to advertize
> a zero window.
>
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.

Thanks for such a detailed explanation :) However, undeniably, I've
read several times until I totally follow the issue you reported. I
wonder if you can compact and simplify the commit message a little
bit, say, remove the last two calltraces that can be replaced by a
single sentence like you've already stated (which is "the queue is
empty").

Let me rephrase the scenario to see if I'm understanding in the right way:
1. First time the receiver sends a skb with zero window because of
shortage of memory.
2. Sometimes the user calls the tcp_recvmsg() to remove the data from
kernel to user space, sending the ACK process should be triggered,
especially when the queue is empty.
3. Then the client reading the zero window signal stops sending data.

#1 fails to update the rcv_wup and rcv_wnd, so since then in the
manner of __tcp_cleanup_rbuf(), the rcv_window_now variable is always
the same in this case (please see the calculation in
tcp_receive_window()). rcv_window_now, which I presume is the same as
win_now in the above description, should vary when
__tcp_cleanup_rbuf() is called. Because of that, #2 could never send
an ACK.

My small question is why is the new_window always 262144 no matter
what amount of data the user receives? Is it because in
__tcp_select_window() there is no wscale option?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-17 21:40 [net,v2] " jmaloy
  2025-01-17 22:09 ` Eric Dumazet
  2025-01-18 17:01 ` Jason Xing
@ 2025-01-18 20:04 ` Neal Cardwell
  2025-01-20  5:03   ` Jon Maloy
  2 siblings, 1 reply; 20+ messages in thread
From: Neal Cardwell @ 2025-01-18 20:04 UTC (permalink / raw)
  To: jmaloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	imagedong, eric.dumazet, edumazet

On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.

The "if we happen to stall at an unfortunate value of the current
window size" phrase is a little vague... :-) Do you have a sense of
what might count as "unfortunate" here? That might help in crafting a
packetdrill test to reproduce this and have an automated regression
test.

> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.

Since the peer last saw a zero receive window at the time of the
memory-pressure drop, shouldn't the peer be sending repeated zero
window probes, and shouldn't the local host respond to a ZWP with an
ACK with the correct non-zero window?

Do you happen to have a tcpdump .pcap of one of these cases that you can share?

> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]

What is "win_now"? That doesn't seem to correspond to any variable
name in the Linux source tree.  Can this be renamed to the
tcp_select_window() variable it is printing, like "cur_win" or
"effective_win" or "new_win", etc?

Or perhaps you can attach your debugging patch in some email thread? I
agree with Eric that these debug dumps are a little hard to parse
without seeing the patch that allows us to understand what some of
these fields are...

I agree with Eric that probably tp->pred_flags should be cleared, and
a packetdrill test for this would be super-helpful.

thanks,
neal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-18 20:04 ` Neal Cardwell
@ 2025-01-20  5:03   ` Jon Maloy
  2025-01-20 16:10     ` Jon Maloy
  0 siblings, 1 reply; 20+ messages in thread
From: Jon Maloy @ 2025-01-20  5:03 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	imagedong, eric.dumazet, edumazet



On 2025-01-18 15:04, Neal Cardwell wrote:
> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
>>
>> From: Jon Maloy <jmaloy@redhat.com>
>>
>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>> a bug in the way tcp handles window advertising in extreme memory
>> squeeze situations.
>>
>> Under memory pressure, a socket endpoint may temporarily advertise
>> a zero-sized window, but this is not stored as part of the socket data.
>> The reasoning behind this is that it is considered a temporary setting
>> which shouldn't influence any further calculations.
>>
>> However, if we happen to stall at an unfortunate value of the current
>> window size, the algorithm selecting a new value will consistently fail
>> to advertise a non-zero window once we have freed up enough memory.
> 
> The "if we happen to stall at an unfortunate value of the current
> window size" phrase is a little vague... :-) Do you have a sense of
> what might count as "unfortunate" here? That might help in crafting a
> packetdrill test to reproduce this and have an automated regression
> test.

Obviously, it happens when the following code snippet in

__tcp_cleanup_rbuf() {
    [....]
    if (copied > 0 && !time_to_ack &&
        !(sk->sk_shutdown & RCV_SHUTDOWN)) {
             __u32 rcv_window_now = tcp_receive_window(tp);

             /* Optimize, __tcp_select_window() is not cheap. */
             if (2*rcv_window_now <= tp->window_clamp) {
                 __u32 new_window = __tcp_select_window(sk);

                 /* Send ACK now, if this read freed lots of space
                  * in our buffer. Certainly, new_window is new window.
                  * We can advertise it now, if it is not less than 

                  * current one.
                  * "Lots" means "at least twice" here.
                  */
                 if (new_window && new_window >= 2 * rcv_window_now)
                         time_to_ack = true;
            }
     }
     [....]
}

yields time_to_ack = false, i.e.  __tcp_select_window(sk) returns
a value new_window  < (2 *  tcp_receive_window(tp)).

In my log I have for brevity used the following names:

win_now: same as rcv_window_now
     (= tcp_receive_window(tp),
      = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
      = 265469200 + 262144 -  265600160,
      = 131184)

new_win: same as new_window
      (= __tcp_select_window(sk),
       = 0 first time, later 262144 )

rcv_wnd: same as tp->rcv_wnd,
       (=262144)

We see that although the last test actually is pretty close
(262144 >= 262368 ? => false) it is not close enough.


We also notice that
(tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
(new_window >= 2 * rcv_window_now) will always be false.


Too me it looks like __tcp_select_window(sk) doesn't at all take the 
freed-up memory into account when calculating a new window. I haven't
looked into why that is happening.

> 
>> This means that this side's notion of the current window size is
>> different from the one last advertised to the peer, causing the latter
>> to not send any data to resolve the sitution.
> 
> Since the peer last saw a zero receive window at the time of the
> memory-pressure drop, shouldn't the peer be sending repeated zero
> window probes, and shouldn't the local host respond to a ZWP with an
> ACK with the correct non-zero window?

It should, but at the moment when I found this bug the peer stack was 
not the Linux kernel stack, but one we develop for our own purpose. We 
fixed that later, but it still means that traffic stops for a couple of 
seconds now and then before the timer restarts the flow. This happens
too often for comfort in our usage scenarios.
We can of course blame the the peer stack, but I still feel this is a
bug, and that it could be handled better by the kernel stack.
> 
> Do you happen to have a tcpdump .pcap of one of these cases that you can share?

I had one, although not for this particular run, and I cannot find it 
right now. I will continue looking or make a new one. Is there some 
shared space I can put it?

> 
>> The problem occurs on the iperf3 server side, and the socket in question
>> is a completely regular socket with the default settings for the
>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>
>> The following excerpt of a logging session, with own comments added,
>> shows more in detail what is happening:
>>
>> //              tcp_v4_rcv(->)
>> //                tcp_rcv_established(->)
>> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
>> [5201<->39222]:     tcp_data_queue(->)
>> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>>                         [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> 
> What is "win_now"? That doesn't seem to correspond to any variable
> name in the Linux source tree. 

See above.

  Can this be renamed to the
> tcp_select_window() variable it is printing, like "cur_win" or
> "effective_win" or "new_win", etc?
> 
> Or perhaps you can attach your debugging patch in some email thread? I
> agree with Eric that these debug dumps are a little hard to parse
> without seeing the patch that allows us to understand what some of
> these fields are...
> 
> I agree with Eric that probably tp->pred_flags should be cleared, and
> a packetdrill test for this would be super-helpful.

I must admit I have never used packetdrill, but I can make an effort.

///jon

> 
> thanks,
> neal
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-20  5:03   ` Jon Maloy
@ 2025-01-20 16:10     ` Jon Maloy
  2025-01-20 16:22       ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Jon Maloy @ 2025-01-20 16:10 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	eric.dumazet, edumazet, Menglong Dong



On 2025-01-20 00:03, Jon Maloy wrote:
> 
> 
> On 2025-01-18 15:04, Neal Cardwell wrote:
>> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
>>>
>>> From: Jon Maloy <jmaloy@redhat.com>
>>>
>>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>>> a bug in the way tcp handles window advertising in extreme memory
>>> squeeze situations.
>>>
>>> Under memory pressure, a socket endpoint may temporarily advertise
>>> a zero-sized window, but this is not stored as part of the socket data.
>>> The reasoning behind this is that it is considered a temporary setting
>>> which shouldn't influence any further calculations.
>>>
>>> However, if we happen to stall at an unfortunate value of the current
>>> window size, the algorithm selecting a new value will consistently fail
>>> to advertise a non-zero window once we have freed up enough memory.
>>
>> The "if we happen to stall at an unfortunate value of the current
>> window size" phrase is a little vague... :-) Do you have a sense of
>> what might count as "unfortunate" here? That might help in crafting a
>> packetdrill test to reproduce this and have an automated regression
>> test.
> 
> Obviously, it happens when the following code snippet in
> 
> __tcp_cleanup_rbuf() {
>     [....]
>     if (copied > 0 && !time_to_ack &&
>         !(sk->sk_shutdown & RCV_SHUTDOWN)) {
>              __u32 rcv_window_now = tcp_receive_window(tp);
> 
>              /* Optimize, __tcp_select_window() is not cheap. */
>              if (2*rcv_window_now <= tp->window_clamp) {
>                  __u32 new_window = __tcp_select_window(sk);
> 
>                  /* Send ACK now, if this read freed lots of space
>                   * in our buffer. Certainly, new_window is new window.
>                   * We can advertise it now, if it is not less than
>                   * current one.
>                   * "Lots" means "at least twice" here.
>                   */
>                  if (new_window && new_window >= 2 * rcv_window_now)
>                          time_to_ack = true;
>             }
>      }
>      [....]
> }
> 
> yields time_to_ack = false, i.e.  __tcp_select_window(sk) returns
> a value new_window  < (2 *  tcp_receive_window(tp)).
> 
> In my log I have for brevity used the following names:
> 
> win_now: same as rcv_window_now
>      (= tcp_receive_window(tp),
>       = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
>       = 265469200 + 262144 -  265600160,
>       = 131184)
> 
> new_win: same as new_window
>       (= __tcp_select_window(sk),
>        = 0 first time, later 262144 )
> 
> rcv_wnd: same as tp->rcv_wnd,
>        (=262144)
> 
> We see that although the last test actually is pretty close
> (262144 >= 262368 ? => false) it is not close enough.
> 
> 
> We also notice that
> (tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
> 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
> (new_window >= 2 * rcv_window_now) will always be false.
> 
> 
> Too me it looks like __tcp_select_window(sk) doesn't at all take the 
> freed-up memory into account when calculating a new window. I haven't
> looked into why that is happening.
> 
>>
>>> This means that this side's notion of the current window size is
>>> different from the one last advertised to the peer, causing the latter
>>> to not send any data to resolve the sitution.
>>
>> Since the peer last saw a zero receive window at the time of the
>> memory-pressure drop, shouldn't the peer be sending repeated zero
>> window probes, and shouldn't the local host respond to a ZWP with an
>> ACK with the correct non-zero window?
> 
> It should, but at the moment when I found this bug the peer stack was 
> not the Linux kernel stack, but one we develop for our own purpose. We 
> fixed that later, but it still means that traffic stops for a couple of 
> seconds now and then before the timer restarts the flow. This happens
> too often for comfort in our usage scenarios.
> We can of course blame the the peer stack, but I still feel this is a
> bug, and that it could be handled better by the kernel stack.
>>
>> Do you happen to have a tcpdump .pcap of one of these cases that you 
>> can share?
> 
> I had one, although not for this particular run, and I cannot find it 
> right now. I will continue looking or make a new one. Is there some 
> shared space I can put it?

Here it is. Look at frame #1067.

https://passt.top/static/iperf3_jon_zero_window_cut.pcap
> 
>>
>>> The problem occurs on the iperf3 server side, and the socket in question
>>> is a completely regular socket with the default settings for the
>>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>>
>>> The following excerpt of a logging session, with own comments added,
>>> shows more in detail what is happening:
>>>
>>> //              tcp_v4_rcv(->)
>>> //                tcp_rcv_established(->)
>>> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/ 
>>> tcp_data_queue()/5257 ====
>>> [5201<->39222]:     tcp_data_queue(->)
>>> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: 
>>> SKB_DROP_REASON_PROTO_MEM
>>>                         [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 
>>> 265469200, win_now 131184]
>>
>> What is "win_now"? That doesn't seem to correspond to any variable
>> name in the Linux source tree. 
> 
> See above.
> 
>   Can this be renamed to the
>> tcp_select_window() variable it is printing, like "cur_win" or
>> "effective_win" or "new_win", etc?
>>
>> Or perhaps you can attach your debugging patch in some email thread? I
>> agree with Eric that these debug dumps are a little hard to parse
>> without seeing the patch that allows us to understand what some of
>> these fields are...
>>
>> I agree with Eric that probably tp->pred_flags should be cleared, and
>> a packetdrill test for this would be super-helpful.
> 
> I must admit I have never used packetdrill, but I can make an effort.

I hear from other sources that you cannot force a memory exhaustion with
packetdrill anyway, so this sounds like a pointless exercise.

///jon

> 
> ///jon
> 
>>
>> thanks,
>> neal
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-20 16:10     ` Jon Maloy
@ 2025-01-20 16:22       ` Eric Dumazet
  2025-01-24 17:40         ` Jon Maloy
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2025-01-20 16:22 UTC (permalink / raw)
  To: Jon Maloy
  Cc: Neal Cardwell, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
	dgibson, eric.dumazet, Menglong Dong

On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy@redhat.com> wrote:
>
>
>
> On 2025-01-20 00:03, Jon Maloy wrote:
> >
> >
> > On 2025-01-18 15:04, Neal Cardwell wrote:
> >> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
> >>>
> >>> From: Jon Maloy <jmaloy@redhat.com>
> >>>
> >>> Testing with iperf3 using the "pasta" protocol splicer has revealed
> >>> a bug in the way tcp handles window advertising in extreme memory
> >>> squeeze situations.
> >>>
> >>> Under memory pressure, a socket endpoint may temporarily advertise
> >>> a zero-sized window, but this is not stored as part of the socket data.
> >>> The reasoning behind this is that it is considered a temporary setting
> >>> which shouldn't influence any further calculations.
> >>>
> >>> However, if we happen to stall at an unfortunate value of the current
> >>> window size, the algorithm selecting a new value will consistently fail
> >>> to advertise a non-zero window once we have freed up enough memory.
> >>
> >> The "if we happen to stall at an unfortunate value of the current
> >> window size" phrase is a little vague... :-) Do you have a sense of
> >> what might count as "unfortunate" here? That might help in crafting a
> >> packetdrill test to reproduce this and have an automated regression
> >> test.
> >
> > Obviously, it happens when the following code snippet in
> >
> > __tcp_cleanup_rbuf() {
> >     [....]
> >     if (copied > 0 && !time_to_ack &&
> >         !(sk->sk_shutdown & RCV_SHUTDOWN)) {
> >              __u32 rcv_window_now = tcp_receive_window(tp);
> >
> >              /* Optimize, __tcp_select_window() is not cheap. */
> >              if (2*rcv_window_now <= tp->window_clamp) {
> >                  __u32 new_window = __tcp_select_window(sk);
> >
> >                  /* Send ACK now, if this read freed lots of space
> >                   * in our buffer. Certainly, new_window is new window.
> >                   * We can advertise it now, if it is not less than
> >                   * current one.
> >                   * "Lots" means "at least twice" here.
> >                   */
> >                  if (new_window && new_window >= 2 * rcv_window_now)
> >                          time_to_ack = true;
> >             }
> >      }
> >      [....]
> > }
> >
> > yields time_to_ack = false, i.e.  __tcp_select_window(sk) returns
> > a value new_window  < (2 *  tcp_receive_window(tp)).
> >
> > In my log I have for brevity used the following names:
> >
> > win_now: same as rcv_window_now
> >      (= tcp_receive_window(tp),
> >       = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
> >       = 265469200 + 262144 -  265600160,
> >       = 131184)
> >
> > new_win: same as new_window
> >       (= __tcp_select_window(sk),
> >        = 0 first time, later 262144 )
> >
> > rcv_wnd: same as tp->rcv_wnd,
> >        (=262144)
> >
> > We see that although the last test actually is pretty close
> > (262144 >= 262368 ? => false) it is not close enough.
> >
> >
> > We also notice that
> > (tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
> > 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
> > (new_window >= 2 * rcv_window_now) will always be false.
> >
> >
> > Too me it looks like __tcp_select_window(sk) doesn't at all take the
> > freed-up memory into account when calculating a new window. I haven't
> > looked into why that is happening.
> >
> >>
> >>> This means that this side's notion of the current window size is
> >>> different from the one last advertised to the peer, causing the latter
> >>> to not send any data to resolve the sitution.
> >>
> >> Since the peer last saw a zero receive window at the time of the
> >> memory-pressure drop, shouldn't the peer be sending repeated zero
> >> window probes, and shouldn't the local host respond to a ZWP with an
> >> ACK with the correct non-zero window?
> >
> > It should, but at the moment when I found this bug the peer stack was
> > not the Linux kernel stack, but one we develop for our own purpose. We
> > fixed that later, but it still means that traffic stops for a couple of
> > seconds now and then before the timer restarts the flow. This happens
> > too often for comfort in our usage scenarios.
> > We can of course blame the the peer stack, but I still feel this is a
> > bug, and that it could be handled better by the kernel stack.
> >>
> >> Do you happen to have a tcpdump .pcap of one of these cases that you
> >> can share?
> >
> > I had one, although not for this particular run, and I cannot find it
> > right now. I will continue looking or make a new one. Is there some
> > shared space I can put it?
>
> Here it is. Look at frame #1067.
>
> https://passt.top/static/iperf3_jon_zero_window_cut.pcap
> >
> >>
> >>> The problem occurs on the iperf3 server side, and the socket in question
> >>> is a completely regular socket with the default settings for the
> >>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
> >>>
> >>> The following excerpt of a logging session, with own comments added,
> >>> shows more in detail what is happening:
> >>>
> >>> //              tcp_v4_rcv(->)
> >>> //                tcp_rcv_established(->)
> >>> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/
> >>> tcp_data_queue()/5257 ====
> >>> [5201<->39222]:     tcp_data_queue(->)
> >>> [5201<->39222]:        DROPPING skb [265600160..265665640], reason:
> >>> SKB_DROP_REASON_PROTO_MEM
> >>>                         [rcv_nxt 265600160, rcv_wnd 262144, snt_ack
> >>> 265469200, win_now 131184]
> >>
> >> What is "win_now"? That doesn't seem to correspond to any variable
> >> name in the Linux source tree.
> >
> > See above.
> >
> >   Can this be renamed to the
> >> tcp_select_window() variable it is printing, like "cur_win" or
> >> "effective_win" or "new_win", etc?
> >>
> >> Or perhaps you can attach your debugging patch in some email thread? I
> >> agree with Eric that these debug dumps are a little hard to parse
> >> without seeing the patch that allows us to understand what some of
> >> these fields are...
> >>
> >> I agree with Eric that probably tp->pred_flags should be cleared, and
> >> a packetdrill test for this would be super-helpful.
> >
> > I must admit I have never used packetdrill, but I can make an effort.
>
> I hear from other sources that you cannot force a memory exhaustion with
> packetdrill anyway, so this sounds like a pointless exercise.

We certainly can and should add a feature like that to packetdrill.

Documentation/fault-injection/ has some relevant information.

Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf
that could be lowered by a packetdrill script I think.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-20 16:22       ` Eric Dumazet
@ 2025-01-24 17:40         ` Jon Maloy
  2025-01-27  9:53           ` Eric Dumazet
                             ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Jon Maloy @ 2025-01-24 17:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
	dgibson, eric.dumazet, Menglong Dong



On 2025-01-20 11:22, Eric Dumazet wrote:
> On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy@redhat.com> wrote:
>>
>>
>>
>> On 2025-01-20 00:03, Jon Maloy wrote:
>>>
>>>

[...]

>>>> I agree with Eric that probably tp->pred_flags should be cleared, and
>>>> a packetdrill test for this would be super-helpful.
>>>
>>> I must admit I have never used packetdrill, but I can make an effort.
>>
>> I hear from other sources that you cannot force a memory exhaustion with
>> packetdrill anyway, so this sounds like a pointless exercise.
> 
> We certainly can and should add a feature like that to packetdrill.
> 
> Documentation/fault-injection/ has some relevant information.
> 
> Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf
> that could be lowered by a packetdrill script I think.
> 
Neal, Eric,
How do you suggest we proceed with this?
I downloaded packetdrill and tried it a bit, but to understand it well 
enough to introduce a new feature would require more time than I am
able to spend on this. Maybe Neal, who I see is one of the contributors 
to packetdrill could help out?

I can certainly clear tp->pred_flags and post it again, maybe with
an improved and shortened log. Would that be acceptable?

I also made a run where I looked into why __tcp_select_window()
ignores all the space that has been freed up:


  tcp_recvmsg_locked(->)
    __tcp_cleanup_rbuf(->) (copied 131072)
      tp->rcv_wup: 1788299855, tp->rcv_wnd: 5812224,
      tp->rcv_nxt 1793800175
      __tcp_select_window(->)
        tcp_space(->)
        tcp_space(<-) returning 458163
        free_space = round_down(458163, 1 << 4096) = 454656
        (free_space > tp->rcv_ssthresh) -->
          free_space = tp->rcv_ssthresh = 261920
        window = ALIGN(261920, 4096) = 26144
      __tcp_select_window(<-) returning 262144
      [rcv_win_now 311904, 2 * rcv_win_now 623808, new_window 262144]
      (new_window >= (2 * rcv_win_now)) ? --> time_to_ack 0
      NOT calling tcp_send_ack()
    __tcp_cleanup_rbuf(<-)
    [tp->rcv_wup 1788299855, tp->rcv_wnd 5812224,
     tp->rcv_nxt 1793800175]
  tcp_recvmsg_locked(<-) returning 131072 bytes.
  [tp->rcv_nxt 1793800175, tp->rcv_wnd 5812224,
   tp->rcv_wup 1788299855, sk->last_ack 0, tcp_receive_win() 311904,
   copied_seq 1788299855->1788395953 (96098), unread 5404222,
   sk_rcv_qlen 83, ofo_qlen 0]


As we see tp->rcv_ssthresh is the limiting factor, causing
a consistent situation where (new_window < (rcv_win_now * 2)),
and even (new_window < rcv_win_now).

To me, it looks like tp->ssthresh should have a higher value
in this situation, or maybe we should alter this test.

The combination of these two issues, -not updating tp->wnd and
_tcp_select_window() returning a wrong value, is what is causing
this whole problem.

///jon






^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-24 17:40         ` Jon Maloy
@ 2025-01-27  9:53           ` Eric Dumazet
  2025-01-27 10:01           ` Stefano Brivio
  2025-01-27 16:37           ` Eric Dumazet
  2 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2025-01-27  9:53 UTC (permalink / raw)
  To: Jon Maloy
  Cc: Neal Cardwell, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
	dgibson, eric.dumazet, Menglong Dong

On Fri, Jan 24, 2025 at 6:40 PM Jon Maloy <jmaloy@redhat.com> wrote:
>
>
>
> On 2025-01-20 11:22, Eric Dumazet wrote:
> > On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy@redhat.com> wrote:
> >>
> >>
> >>
> >> On 2025-01-20 00:03, Jon Maloy wrote:
> >>>
> >>>
>
> [...]
>
> >>>> I agree with Eric that probably tp->pred_flags should be cleared, and
> >>>> a packetdrill test for this would be super-helpful.
> >>>
> >>> I must admit I have never used packetdrill, but I can make an effort.
> >>
> >> I hear from other sources that you cannot force a memory exhaustion with
> >> packetdrill anyway, so this sounds like a pointless exercise.
> >
> > We certainly can and should add a feature like that to packetdrill.
> >
> > Documentation/fault-injection/ has some relevant information.
> >
> > Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf
> > that could be lowered by a packetdrill script I think.
> >
> Neal, Eric,
> How do you suggest we proceed with this?
> I downloaded packetdrill and tried it a bit, but to understand it well
> enough to introduce a new feature would require more time than I am
> able to spend on this. Maybe Neal, who I see is one of the contributors
> to packetdrill could help out?

 I will spend some time this week preparing for some tests.

I would prefer not merging new code without a clear understanding of the issue.

Thanks.

>
> I can certainly clear tp->pred_flags and post it again, maybe with
> an improved and shortened log. Would that be acceptable?
>
> I also made a run where I looked into why __tcp_select_window()
> ignores all the space that has been freed up:
>
>
>   tcp_recvmsg_locked(->)
>     __tcp_cleanup_rbuf(->) (copied 131072)
>       tp->rcv_wup: 1788299855, tp->rcv_wnd: 5812224,
>       tp->rcv_nxt 1793800175
>       __tcp_select_window(->)
>         tcp_space(->)
>         tcp_space(<-) returning 458163
>         free_space = round_down(458163, 1 << 4096) = 454656
>         (free_space > tp->rcv_ssthresh) -->
>           free_space = tp->rcv_ssthresh = 261920
>         window = ALIGN(261920, 4096) = 26144
>       __tcp_select_window(<-) returning 262144
>       [rcv_win_now 311904, 2 * rcv_win_now 623808, new_window 262144]
>       (new_window >= (2 * rcv_win_now)) ? --> time_to_ack 0
>       NOT calling tcp_send_ack()
>     __tcp_cleanup_rbuf(<-)
>     [tp->rcv_wup 1788299855, tp->rcv_wnd 5812224,
>      tp->rcv_nxt 1793800175]
>   tcp_recvmsg_locked(<-) returning 131072 bytes.
>   [tp->rcv_nxt 1793800175, tp->rcv_wnd 5812224,
>    tp->rcv_wup 1788299855, sk->last_ack 0, tcp_receive_win() 311904,
>    copied_seq 1788299855->1788395953 (96098), unread 5404222,
>    sk_rcv_qlen 83, ofo_qlen 0]
>
>
> As we see tp->rcv_ssthresh is the limiting factor, causing
> a consistent situation where (new_window < (rcv_win_now * 2)),
> and even (new_window < rcv_win_now).
>
> To me, it looks like tp->ssthresh should have a higher value
> in this situation, or maybe we should alter this test.
>
> The combination of these two issues, -not updating tp->wnd and
> _tcp_select_window() returning a wrong value, is what is causing
> this whole problem.
>
> ///jon
>
>
>
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-24 17:40         ` Jon Maloy
  2025-01-27  9:53           ` Eric Dumazet
@ 2025-01-27 10:01           ` Stefano Brivio
  2025-01-27 10:06             ` Eric Dumazet
  2025-01-27 10:17             ` Jason Xing
  2025-01-27 16:37           ` Eric Dumazet
  2 siblings, 2 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-27 10:01 UTC (permalink / raw)
  To: Jon Maloy, Eric Dumazet
  Cc: Neal Cardwell, netdev, davem, kuba, passt-dev, lvivier, dgibson,
	eric.dumazet, Menglong Dong

On Fri, 24 Jan 2025 12:40:16 -0500
Jon Maloy <jmaloy@redhat.com> wrote:

> I can certainly clear tp->pred_flags and post it again, maybe with
> an improved and shortened log. Would that be acceptable?  

Talking about an improved log, what strikes me the most of the whole
problem is:

$ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }'
 1064   0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480
 1065   0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480
 1066   0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328
 1067   0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0
 1068   6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2      ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29

...and then the silence, 192.168.122.198 never announces that its
window is not zero, so the peer gives up 15 seconds later:

$ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }'
 1069   8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1
 1070   0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0

Data in frame #1069 is iperf3 ending the test.

This didn't happen before e2142825c120 ("net: tcp: send zero-window
ACK when no memory") so it's a relatively recent (17 months) regression.

It actually looks pretty simple (and rather serious) to me.

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-27 10:01           ` Stefano Brivio
@ 2025-01-27 10:06             ` Eric Dumazet
  2025-01-27 10:27               ` Stefano Brivio
  2025-01-27 10:17             ` Jason Xing
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2025-01-27 10:06 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Jon Maloy, Neal Cardwell, netdev, davem, kuba, passt-dev,
	lvivier, dgibson, eric.dumazet, Menglong Dong

On Mon, Jan 27, 2025 at 11:01 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 24 Jan 2025 12:40:16 -0500
> Jon Maloy <jmaloy@redhat.com> wrote:
>
> > I can certainly clear tp->pred_flags and post it again, maybe with
> > an improved and shortened log. Would that be acceptable?
>
> Talking about an improved log, what strikes me the most of the whole
> problem is:
>
> $ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }'
>  1064   0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480
>  1065   0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480
>  1066   0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328
>  1067   0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0
>  1068   6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2      ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29
>
> ...and then the silence, 192.168.122.198 never announces that its
> window is not zero, so the peer gives up 15 seconds later:
>
> $ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }'
>  1069   8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1
>  1070   0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0
>
> Data in frame #1069 is iperf3 ending the test.
>
> This didn't happen before e2142825c120 ("net: tcp: send zero-window
> ACK when no memory") so it's a relatively recent (17 months) regression.
>
> It actually looks pretty simple (and rather serious) to me.
>

With all that, it should be pretty easy to cook a packetdrill test, right ?

packetdrill tests are part of tools/testing/selftests/net/ already, we
are not asking for something unreasonable.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-27 10:06             ` Eric Dumazet
@ 2025-01-27 10:27               ` Stefano Brivio
  0 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-27 10:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jon Maloy, Neal Cardwell, netdev, davem, kuba, passt-dev,
	lvivier, dgibson, eric.dumazet, Menglong Dong

On Mon, 27 Jan 2025 11:06:07 +0100
Eric Dumazet <edumazet@google.com> wrote:

> On Mon, Jan 27, 2025 at 11:01 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 24 Jan 2025 12:40:16 -0500
> > Jon Maloy <jmaloy@redhat.com> wrote:
> >  
> > > I can certainly clear tp->pred_flags and post it again, maybe with
> > > an improved and shortened log. Would that be acceptable?  
> >
> > Talking about an improved log, what strikes me the most of the whole
> > problem is:
> >
> > $ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }'
> >  1064   0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480
> >  1065   0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480
> >  1066   0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328
> >  1067   0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0
> >  1068   6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2      ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29
> >
> > ...and then the silence, 192.168.122.198 never announces that its
> > window is not zero, so the peer gives up 15 seconds later:
> >
> > $ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }'
> >  1069   8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1
> >  1070   0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0
> >
> > Data in frame #1069 is iperf3 ending the test.
> >
> > This didn't happen before e2142825c120 ("net: tcp: send zero-window
> > ACK when no memory") so it's a relatively recent (17 months) regression.
> >
> > It actually looks pretty simple (and rather serious) to me.
> 
> With all that, it should be pretty easy to cook a packetdrill test, right ?

Not really :( because to reproduce this exact condition you need to
somehow get the right amount of memory pressure so that you can
actually establish a connection, start the transfer, and then exhaust
the receive buffer at the right moment.

And packetdrill doesn't do that. Sure, it would be great if it did, and
it's probably a nice feature to implement... given enough time. Given
less time, I guess fixing regressions has a higher priority.

One could perhaps tweak sk->sk_rcvbuf as you suggested but that just
artificially reproduces one part of it. It's not a really fitting test.
For example: when would you increase it back?

> packetdrill tests are part of tools/testing/selftests/net/ already, we
> are not asking for something unreasonable.

I would agree, in general, except that I don't see a way to craft a
test like this with packetdrill. At least not trivially with the
current feature set.

On top of that, this is not a new feature, it's a fix for a regression
(that was introduced without adding any test, of course). And the fix
itself was definitely tested, just not with packetdrill.

Requesting that tests are 1. automated and 2. written with a specific
tool is something I can quite understand for general convenience, but
I don't think it always makes sense.

Especially as this fix has been blocked for about 9 months now because
of the fact that automating a test for it is quite hard.

-- 
Stefano

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-27 10:01           ` Stefano Brivio
  2025-01-27 10:06             ` Eric Dumazet
@ 2025-01-27 10:17             ` Jason Xing
  2025-01-27 10:32               ` Stefano Brivio
  1 sibling, 1 reply; 20+ messages in thread
From: Jason Xing @ 2025-01-27 10:17 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Jon Maloy, Eric Dumazet, Neal Cardwell, netdev, davem, kuba,
	passt-dev, lvivier, dgibson, eric.dumazet, Menglong Dong

On Mon, Jan 27, 2025 at 6:01 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 24 Jan 2025 12:40:16 -0500
> Jon Maloy <jmaloy@redhat.com> wrote:
>
> > I can certainly clear tp->pred_flags and post it again, maybe with
> > an improved and shortened log. Would that be acceptable?
>
> Talking about an improved log, what strikes me the most of the whole
> problem is:
>
> $ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }'
>  1064   0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480
>  1065   0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480
>  1066   0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328
>  1067   0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0
>  1068   6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2      ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29
>
> ...and then the silence, 192.168.122.198 never announces that its
> window is not zero, so the peer gives up 15 seconds later:
>
> $ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }'
>  1069   8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1
>  1070   0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0
>
> Data in frame #1069 is iperf3 ending the test.
>
> This didn't happen before e2142825c120 ("net: tcp: send zero-window
> ACK when no memory") so it's a relatively recent (17 months) regression.
>
> It actually looks pretty simple (and rather serious) to me.

I remembered last time it really also took me some time to totally
follow. Packetdrill should be helpful :)

As to the patch itself, I agreed with this fix last time while now I
have to re-read that long analysis to recall as much as possible. I'm
not that sure if it's a bug belonging to the Linux kernel. The other
side not sending a window probe causes this issue...? The other part
of me says we cannot break the user's behaviour.

One way or another, I will also take a look at it again.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-27 10:17             ` Jason Xing
@ 2025-01-27 10:32               ` Stefano Brivio
  2025-01-27 13:37                 ` Menglong Dong
  0 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-27 10:32 UTC (permalink / raw)
  To: Jason Xing
  Cc: Jon Maloy, Eric Dumazet, Neal Cardwell, netdev, davem, kuba,
	passt-dev, lvivier, dgibson, eric.dumazet, Menglong Dong

On Mon, 27 Jan 2025 18:17:28 +0800
Jason Xing <kerneljasonxing@gmail.com> wrote:

> I'm not that sure if it's a bug belonging to the Linux kernel.

It is, because for at least 20-25 years (before that it's a bit hard to
understand from history) a non-zero window would be announced, as
obviously expected, once there's again space in the receive window.

> The other side not sending a window probe causes this issue...?

It doesn't cause this issue, but it triggers it.

> The other part of me says we cannot break the user's behaviour.

This sounds quite relevant, yes.

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-27 10:32               ` Stefano Brivio
@ 2025-01-27 13:37                 ` Menglong Dong
  2025-01-27 14:03                   ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: Menglong Dong @ 2025-01-27 13:37 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Jason Xing, Jon Maloy, Eric Dumazet, Neal Cardwell, netdev,
	davem, kuba, passt-dev, lvivier, dgibson, eric.dumazet

On Mon, Jan 27, 2025 at 6:32 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Mon, 27 Jan 2025 18:17:28 +0800
> Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> > I'm not that sure if it's a bug belonging to the Linux kernel.
>
> It is, because for at least 20-25 years (before that it's a bit hard to
> understand from history) a non-zero window would be announced, as
> obviously expected, once there's again space in the receive window.

Sorry for the late reply. I think the key of this problem is
what should we do when we receive a tcp packet and we are
out of memory.

The RFC doesn't define such a thing, so in the commit
e2142825c120 ("net: tcp: send zero-window ACK when no memory"),
I reply with a zero-window ACK to the peer. And the peer will keep
probing the window by retransmitting the packet that we dropped if
the peer is a LINUX SYSTEM.

As I said, the RFC doesn't define such a case, so the behavior of
the peer is undefined if it is not a LINUX SYSTEM. If the peer doesn't
keep retransmitting the packet, it will hang the connection, just like
the problem that described in this commit log.

However, we can make some optimization to make it more
adaptable. We can send a ACK with the right window to the
peer when the memory is available, and __tcp_cleanup_rbuf()
is a good choice.

Generally speaking, I think this patch makes sense. However,
I'm not sure if there is any other influence if we make
"tp->rcv_wnd=0", but it can trigger a ACK in __tcp_cleanup_rbuf().

Following is the code that I thought before to optimize this
case (the code is totally not tested):

diff --git a/include/net/inet_connection_sock.h
b/include/net/inet_connection_sock.h
index 3c82fad904d4..bedd78946762 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -116,7 +116,8 @@ struct inet_connection_sock {
         #define ATO_BITS 8
         __u32          ato:ATO_BITS,     /* Predicted tick of soft
clock       */
                   lrcv_flowlabel:20, /* last received ipv6 flowlabel       */
-                  unused:4;
+                  is_oom:1,
+                  unused:3;
         unsigned long      timeout;     /* Currently scheduled
timeout           */
         __u32          lrcvtime;     /* timestamp of last received
data packet */
         __u16          last_seg_size; /* Size of last incoming segment       */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..6f3c85a1f4da 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1458,11 +1458,11 @@ static int tcp_peek_sndq(struct sock *sk,
struct msghdr *msg, int len)
  */
 void __tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
+    struct inet_connection_sock *icsk = inet_csk(sk);
     struct tcp_sock *tp = tcp_sk(sk);
     bool time_to_ack = false;

     if (inet_csk_ack_scheduled(sk)) {
-        const struct inet_connection_sock *icsk = inet_csk(sk);

         if (/* Once-per-two-segments ACK was not sent by tcp_input.c */
             tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
@@ -1502,6 +1502,11 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
                 time_to_ack = true;
         }
     }
+    if (unlikely(icsk->icsk_ack.is_oom)) {
+        icsk->icsk_ack.is_oom = false;
+        time_to_ack = true;
+    }
+
     if (time_to_ack)
         tcp_send_ack(sk);
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..e2d65213b3b7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -268,9 +268,12 @@ static u16 tcp_select_window(struct sock *sk)
      * are out of memory. The window is temporary, so we don't store
      * it on the socket.
      */
-    if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+    if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+        inet_csk(sk)->icsk_ack.is_oom = true;
         return 0;
+    }

+    inet_csk(sk)->icsk_ack.is_oom = false;
     cur_win = tcp_receive_window(tp);
     new_win = __tcp_select_window(sk);
     if (new_win < cur_win) {

>
> > The other side not sending a window probe causes this issue...?
>
> It doesn't cause this issue, but it triggers it.
>
> > The other part of me says we cannot break the user's behaviour.
>
> This sounds quite relevant, yes.
>
> --
> Stefano
>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-27 13:37                 ` Menglong Dong
@ 2025-01-27 14:03                   ` Stefano Brivio
  0 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-27 14:03 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Jason Xing, Jon Maloy, Eric Dumazet, Neal Cardwell, netdev,
	davem, kuba, passt-dev, lvivier, dgibson, eric.dumazet

On Mon, 27 Jan 2025 21:37:23 +0800
Menglong Dong <menglong8.dong@gmail.com> wrote:

> On Mon, Jan 27, 2025 at 6:32 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Mon, 27 Jan 2025 18:17:28 +0800
> > Jason Xing <kerneljasonxing@gmail.com> wrote:
> >  
> > > I'm not that sure if it's a bug belonging to the Linux kernel.  
> >
> > It is, because for at least 20-25 years (before that it's a bit hard to
> > understand from history) a non-zero window would be announced, as
> > obviously expected, once there's again space in the receive window.  
> 
> Sorry for the late reply. I think the key of this problem is
> what should we do when we receive a tcp packet and we are
> out of memory.
> 
> The RFC doesn't define such a thing,

Why not? RFC 9293, 3.8.6:

  There is an assumption that this is related to the data buffer space
  currently available for this connection.

That is, out-of-memory -> zero window.

> so in the commit
> e2142825c120 ("net: tcp: send zero-window ACK when no memory"),
> I reply with a zero-window ACK to the peer.

Your patch is fundamentally correct, nobody is disputing that. The
problem is that it introduces a side effect because it gets the notion
of "current window" out of sync by sending a one-off packet with a
zero-window, without recording that.

> And the peer will keep
> probing the window by retransmitting the packet that we dropped if
> the peer is a LINUX SYSTEM.
> 
> As I said, the RFC doesn't define such a case, so the behavior of
> the peer is undefined if it is not a LINUX SYSTEM. If the peer doesn't
> keep retransmitting the packet, it will hang the connection, just like
> the problem that described in this commit log.

It's not undefined. RFC 9293 3.8.6.1 (just like RFC 1122 4.2.2.17,
RFC 793 3.7) requires zero-window probes.

But keeping the window closed indefinitely if there's no zero-window
probe is a regression anyway:

- a retransmission timeout must elapse (RFC 9293 3.8.1) before the
  zero-window probe is sent, so relying on zero-window probes means
  introducing an unnecessary delay

- if the peer (as it was the case here) fails to send a zero-window
  probe for whatever reason, things break. This is a userspace
  breakage, regardless of the fact that the peer should send a
  zero-window probe

> However, we can make some optimization to make it more
> adaptable. We can send a ACK with the right window to the
> peer when the memory is available, and __tcp_cleanup_rbuf()
> is a good choice.
> 
> Generally speaking, I think this patch makes sense. However,
> I'm not sure if there is any other influence if we make
> "tp->rcv_wnd=0", but it can trigger a ACK in __tcp_cleanup_rbuf().

I don't understand what's your concern with the patch that was proposed
(and tested quite thoroughly, by the way).

> Following is the code that I thought before to optimize this
> case (the code is totally not tested):
>
> [...]

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net,v2] tcp: correct handling of extreme memory squeeze
  2025-01-24 17:40         ` Jon Maloy
  2025-01-27  9:53           ` Eric Dumazet
  2025-01-27 10:01           ` Stefano Brivio
@ 2025-01-27 16:37           ` Eric Dumazet
  2 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2025-01-27 16:37 UTC (permalink / raw)
  To: Jon Maloy
  Cc: Neal Cardwell, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
	dgibson, eric.dumazet, Menglong Dong

On Fri, Jan 24, 2025 at 6:40 PM Jon Maloy <jmaloy@redhat.com> wrote:
>
>
>
> On 2025-01-20 11:22, Eric Dumazet wrote:
> > On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy@redhat.com> wrote:
> >>
> >>
> >>
> >> On 2025-01-20 00:03, Jon Maloy wrote:
> >>>
> >>>
>
> [...]
>
> >>>> I agree with Eric that probably tp->pred_flags should be cleared, and
> >>>> a packetdrill test for this would be super-helpful.
> >>>
> >>> I must admit I have never used packetdrill, but I can make an effort.
> >>
> >> I hear from other sources that you cannot force a memory exhaustion with
> >> packetdrill anyway, so this sounds like a pointless exercise.
> >
> > We certainly can and should add a feature like that to packetdrill.
> >
> > Documentation/fault-injection/ has some relevant information.
> >
> > Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf
> > that could be lowered by a packetdrill script I think.
> >
> Neal, Eric,
> How do you suggest we proceed with this?
> I downloaded packetdrill and tried it a bit, but to understand it well
> enough to introduce a new feature would require more time than I am
> able to spend on this. Maybe Neal, who I see is one of the contributors
> to packetdrill could help out?
>
> I can certainly clear tp->pred_flags and post it again, maybe with
> an improved and shortened log. Would that be acceptable?

Yes.

>
> I also made a run where I looked into why __tcp_select_window()
> ignores all the space that has been freed up:
>
>
>   tcp_recvmsg_locked(->)
>     __tcp_cleanup_rbuf(->) (copied 131072)
>       tp->rcv_wup: 1788299855, tp->rcv_wnd: 5812224,
>       tp->rcv_nxt 1793800175
>       __tcp_select_window(->)
>         tcp_space(->)
>         tcp_space(<-) returning 458163
>         free_space = round_down(458163, 1 << 4096) = 454656
>         (free_space > tp->rcv_ssthresh) -->
>           free_space = tp->rcv_ssthresh = 261920
>         window = ALIGN(261920, 4096) = 26144
>       __tcp_select_window(<-) returning 262144
>       [rcv_win_now 311904, 2 * rcv_win_now 623808, new_window 262144]
>       (new_window >= (2 * rcv_win_now)) ? --> time_to_ack 0
>       NOT calling tcp_send_ack()
>     __tcp_cleanup_rbuf(<-)
>     [tp->rcv_wup 1788299855, tp->rcv_wnd 5812224,
>      tp->rcv_nxt 1793800175]
>   tcp_recvmsg_locked(<-) returning 131072 bytes.
>   [tp->rcv_nxt 1793800175, tp->rcv_wnd 5812224,
>    tp->rcv_wup 1788299855, sk->last_ack 0, tcp_receive_win() 311904,
>    copied_seq 1788299855->1788395953 (96098), unread 5404222,
>    sk_rcv_qlen 83, ofo_qlen 0]
>
>
> As we see tp->rcv_ssthresh is the limiting factor, causing
> a consistent situation where (new_window < (rcv_win_now * 2)),
> and even (new_window < rcv_win_now).

Your changelog could simply explain this, in one sentence. instead of
lengthy traces.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-01-27 16:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-16  2:29 [net, v2] tcp: correct handling of extreme memory squeeze Jon Maloy
2025-01-16 21:14 ` Stefano Brivio
2025-01-17 21:40 [net,v2] " jmaloy
2025-01-17 22:09 ` Eric Dumazet
2025-01-17 22:27   ` Stefano Brivio
2025-01-18 17:01 ` Jason Xing
2025-01-18 20:04 ` Neal Cardwell
2025-01-20  5:03   ` Jon Maloy
2025-01-20 16:10     ` Jon Maloy
2025-01-20 16:22       ` Eric Dumazet
2025-01-24 17:40         ` Jon Maloy
2025-01-27  9:53           ` Eric Dumazet
2025-01-27 10:01           ` Stefano Brivio
2025-01-27 10:06             ` Eric Dumazet
2025-01-27 10:27               ` Stefano Brivio
2025-01-27 10:17             ` Jason Xing
2025-01-27 10:32               ` Stefano Brivio
2025-01-27 13:37                 ` Menglong Dong
2025-01-27 14:03                   ` Stefano Brivio
2025-01-27 16:37           ` Eric Dumazet

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).