public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [net,v3] tcp: correct handling of extreme memory squeeze
@ 2025-01-27 23:13 jmaloy
  2025-01-28  0:52 ` Jason Xing
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: jmaloy @ 2025-01-27 23:13 UTC (permalink / raw)
  To: netdev, davem
  Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson,
	memnglong8.dong, kerneljasonxing, ncardwell, eric.dumazet,
	edumazet

From: Jon Maloy <jmaloy@redhat.com>

Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations.

Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.

However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.

The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.

The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:

//              tcp_v4_rcv(->)
//                tcp_rcv_established(->)
[5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]:     tcp_data_queue(->)
[5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
                       [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                       [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
                       [OFO queue: gap: 65480, len: 0]
[5201<->39222]:     tcp_data_queue(<-)
[5201<->39222]:     __tcp_transmit_skb(->)
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:       tcp_select_window(->)
[5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
                        returning 0
[5201<->39222]:       tcp_select_window(<-)
[5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]:     [__tcp_transmit_skb(<-)
[5201<->39222]:   tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)

// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
                  returning 6104 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
                  returning 131072 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The above pattern repeats again and again, since nothing changes
// between the reads.

[...]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
                  returning 54672 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.

Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.

We fix this by updating the socket state correctly when a packet has
been dropped because of memory exhaustion and we have to advertize
a zero window.

Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
Cc: Menglong Dong <menglong8.dong@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v1: -Posted on Apr 6, 2024:
     https://lore.kernel.org/all/20240406182107.261472-1-jmaloy@redhat.com/#r
v2: -Improved commit log to clarify how we end up in this situation.
    -After feedback from Eric Dumazet, removed references to use of
     SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
     about how this situation occurs. Those flags are used at the
     peer side's incoming connection, and not on this one.
v3: -Clearing tp->pred_flags when we do a zero advertisement.
    -Edited (manually) and shortened the log documenting the problem.
---
 net/ipv4/tcp_output.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..bc95d2a5924f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -265,11 +265,14 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->pred_flags = 0;
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
@@ -265,11 +265,14 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 cur_win, new_win;
 
 	/* Make the window 0 if we failed to queue the data because we
-	 * are out of memory. The window is temporary, so we don't store
-	 * it on the socket.
+	 * are out of memory.
 	 */
-	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+		tp->pred_flags = 0;
+		tp->rcv_wnd = 0;
+		tp->rcv_wup = tp->rcv_nxt;
 		return 0;
+	}
 
 	cur_win = tcp_receive_window(tp);
 	new_win = __tcp_select_window(sk);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-27 23:13 [net,v3] tcp: correct handling of extreme memory squeeze jmaloy
@ 2025-01-28  0:52 ` Jason Xing
  2025-01-28 15:04 ` Eric Dumazet
  2025-01-30  3:10 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 9+ messages in thread
From: Jason Xing @ 2025-01-28  0:52 UTC (permalink / raw)
  To: jmaloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	memnglong8.dong, ncardwell, eric.dumazet, edumazet

On Tue, Jan 28, 2025 at 7:13 AM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
>                        [OFO queue: gap: 65480, len: 0]
> [5201<->39222]:     tcp_data_queue(<-)
> [5201<->39222]:     __tcp_transmit_skb(->)
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:       tcp_select_window(->)
> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>                         returning 0
> [5201<->39222]:       tcp_select_window(<-)
> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]:     [__tcp_transmit_skb(<-)
> [5201<->39222]:   tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>                   returning 6104 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>                   returning 131072 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>                   returning 54672 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
>
> Furthermore, we have observed that in these situations this side may
> send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
> as it should be. Backing ack_seq seems to be harmless, but is of
> course still wrong from a protocol viewpoint.
>
> We fix this by updating the socket state correctly when a packet has
> been dropped because of memory exhaustion and we have to advertize
> a zero window.
>
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
>
> Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
> Cc: Menglong Dong <menglong8.dong@gmail.com>
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>

I reckon that adding more description about why this case can be
triggered (the reason behind this case is the other side using other
kernels doesn't periodically send a window probe) is really necessary.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks,
Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-27 23:13 [net,v3] tcp: correct handling of extreme memory squeeze jmaloy
  2025-01-28  0:52 ` Jason Xing
@ 2025-01-28 15:04 ` Eric Dumazet
  2025-01-28 15:18   ` Stefano Brivio
                     ` (2 more replies)
  2025-01-30  3:10 ` patchwork-bot+netdevbpf
  2 siblings, 3 replies; 9+ messages in thread
From: Eric Dumazet @ 2025-01-28 15:04 UTC (permalink / raw)
  To: jmaloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	memnglong8.dong, kerneljasonxing, ncardwell, eric.dumazet

On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> //              tcp_v4_rcv(->)
> //                tcp_rcv_established(->)
> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]:     tcp_data_queue(->)
> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
>                        [OFO queue: gap: 65480, len: 0]
> [5201<->39222]:     tcp_data_queue(<-)
> [5201<->39222]:     __tcp_transmit_skb(->)
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:       tcp_select_window(->)
> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
>                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>                         returning 0
> [5201<->39222]:       tcp_select_window(<-)
> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]:     [__tcp_transmit_skb(<-)
> [5201<->39222]:   tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>                   returning 6104 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>                   returning 131072 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]:     NOT calling tcp_send_ack()
>                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>                   [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>                   returning 54672 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.

This so-called 'deadlock' only occurs if a remote TCP stack is unable
to send win0 probes.

In this case, sending some ACK will not help reliably if these ACK get lost.

I find the description tries very hard to hide a bug in another stack,
for some reason.

When under memory stress, not sending an opening ACK as fast as possible,
giving time for the host to recover from this memory stress was also a
sensible thing to do.

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks for the fix.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-28 15:04 ` Eric Dumazet
@ 2025-01-28 15:18   ` Stefano Brivio
  2025-01-28 15:56   ` Neal Cardwell
  2025-01-28 16:51   ` Jon Maloy
  2 siblings, 0 replies; 9+ messages in thread
From: Stefano Brivio @ 2025-01-28 15:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: jmaloy, netdev, davem, kuba, passt-dev, lvivier, dgibson,
	memnglong8.dong, kerneljasonxing, ncardwell, eric.dumazet

On Tue, 28 Jan 2025 16:04:35 +0100
Eric Dumazet <edumazet@google.com> wrote:

> This so-called 'deadlock' only occurs if a remote TCP stack is unable
> to send win0 probes.
> 
> In this case, sending some ACK will not help reliably if these ACK get lost.
> 
> I find the description tries very hard to hide a bug in another stack,
> for some reason.

Side note: that was fixed meanwhile. Back then, at a first analysis, we
thought it was a workaround, but it's the actual fix as we *must* send
zero-window probes in that situation, and this commit triggers them:

  https://passt.top/passt/commit/?id=a740e16fd1b9bdca8d259aa6d37f942a3874425c

-- 
Stefano


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-28 15:04 ` Eric Dumazet
  2025-01-28 15:18   ` Stefano Brivio
@ 2025-01-28 15:56   ` Neal Cardwell
  2025-01-28 16:01     ` Eric Dumazet
  2025-01-28 16:51   ` Jon Maloy
  2 siblings, 1 reply; 9+ messages in thread
From: Neal Cardwell @ 2025-01-28 15:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: jmaloy, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
	dgibson, memnglong8.dong, kerneljasonxing, eric.dumazet

On Tue, Jan 28, 2025 at 10:04 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote:
> >
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > Testing with iperf3 using the "pasta" protocol splicer has revealed
> > a bug in the way tcp handles window advertising in extreme memory
> > squeeze situations.
> >
> > Under memory pressure, a socket endpoint may temporarily advertise
> > a zero-sized window, but this is not stored as part of the socket data.
> > The reasoning behind this is that it is considered a temporary setting
> > which shouldn't influence any further calculations.
> >
> > However, if we happen to stall at an unfortunate value of the current
> > window size, the algorithm selecting a new value will consistently fail
> > to advertise a non-zero window once we have freed up enough memory.
> > This means that this side's notion of the current window size is
> > different from the one last advertised to the peer, causing the latter
> > to not send any data to resolve the sitution.
> >
> > The problem occurs on the iperf3 server side, and the socket in question
> > is a completely regular socket with the default settings for the
> > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
> >
> > The following excerpt of a logging session, with own comments added,
> > shows more in detail what is happening:
> >
> > //              tcp_v4_rcv(->)
> > //                tcp_rcv_established(->)
> > [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> > [5201<->39222]:     tcp_data_queue(->)
> > [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
> >                        [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                        [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> >                        [OFO queue: gap: 65480, len: 0]
> > [5201<->39222]:     tcp_data_queue(<-)
> > [5201<->39222]:     __tcp_transmit_skb(->)
> >                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:       tcp_select_window(->)
> > [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
> >                         [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> >                         returning 0
> > [5201<->39222]:       tcp_select_window(<-)
> > [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
> > [5201<->39222]:     [__tcp_transmit_skb(<-)
> > [5201<->39222]:   tcp_rcv_established(<-)
> > [5201<->39222]: tcp_v4_rcv(<-)
> >
> > // Receive queue is at 85 buffers and we are out of memory.
> > // We drop the incoming buffer, although it is in sequence, and decide
> > // to send an advertisement with a window of zero.
> > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> > // we unconditionally shrink the window.
> >
> > [5201<->39222]: tcp_recvmsg_locked(->)
> > [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> > [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> > [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> > [5201<->39222]:     NOT calling tcp_send_ack()
> >                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:   __tcp_cleanup_rbuf(<-)
> >                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                   [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
> >                   returning 6104 bytes
> > [5201<->39222]: tcp_recvmsg_locked(<-)
> >
> > // After each read, the algorithm for calculating the new receive
> > // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> > // or to update tp->rcv_wnd.
> > // Meanwhile, the peer thinks the window is zero, and will not send
> > // any more data to trigger an update from the interrupt mode side.
> >
> > [5201<->39222]: tcp_recvmsg_locked(->)
> > [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> > [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> > [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> > [5201<->39222]:     NOT calling tcp_send_ack()
> >                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:   __tcp_cleanup_rbuf(<-)
> >                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                   [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
> >                   returning 131072 bytes
> > [5201<->39222]: tcp_recvmsg_locked(<-)
> >
> > // The above pattern repeats again and again, since nothing changes
> > // between the reads.
> >
> > [...]
> >
> > [5201<->39222]: tcp_recvmsg_locked(->)
> > [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> > [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> > [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> > [5201<->39222]:     NOT calling tcp_send_ack()
> >                     [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> > [5201<->39222]:   __tcp_cleanup_rbuf(<-)
> >                   [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> >                   [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
> >                   returning 54672 bytes
> > [5201<->39222]: tcp_recvmsg_locked(<-)
> >
> > // The receive queue is empty, but no new advertisement has been sent.
> > // The peer still thinks the receive window is zero, and sends nothing.
> > // We have ended up in a deadlock situation.
>
> This so-called 'deadlock' only occurs if a remote TCP stack is unable
> to send win0 probes.
>
> In this case, sending some ACK will not help reliably if these ACK get lost.
>
> I find the description tries very hard to hide a bug in another stack,
> for some reason.
>
> When under memory stress, not sending an opening ACK as fast as possible,
> giving time for the host to recover from this memory stress was also a
> sensible thing to do.
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
>
> Thanks for the fix.

Yes, thanks for the fix. LGTM as well.

Reviewed-by: Neal Cardwell <ncardwell@google.com>

BTW, IMHO it would be nice to have some sort of NET_INC_STATS() of an
SNMP stat for this case, since we have SNMP stat increases for other
0-window cases. That could help debugging performance problems from
memory pressure and zero windows. But that can be in a separate patch
for net-next once this fix is in net-next.

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-28 15:56   ` Neal Cardwell
@ 2025-01-28 16:01     ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2025-01-28 16:01 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: jmaloy, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
	dgibson, memnglong8.dong, kerneljasonxing, eric.dumazet

On Tue, Jan 28, 2025 at 4:57 PM Neal Cardwell <ncardwell@google.com> wrote:

>
> Yes, thanks for the fix. LGTM as well.
>
> Reviewed-by: Neal Cardwell <ncardwell@google.com>
>
> BTW, IMHO it would be nice to have some sort of NET_INC_STATS() of an
> SNMP stat for this case, since we have SNMP stat increases for other
> 0-window cases. That could help debugging performance problems from
> memory pressure and zero windows. But that can be in a separate patch
> for net-next once this fix is in net-next.

This is LINUX_MIB_TCPRCVQDROP  ( TCPRcvQDrop )

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-28 15:04 ` Eric Dumazet
  2025-01-28 15:18   ` Stefano Brivio
  2025-01-28 15:56   ` Neal Cardwell
@ 2025-01-28 16:51   ` Jon Maloy
  2025-01-28 17:11     ` Eric Dumazet
  2 siblings, 1 reply; 9+ messages in thread
From: Jon Maloy @ 2025-01-28 16:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	memnglong8.dong, kerneljasonxing, ncardwell, eric.dumazet



On 2025-01-28 10:04, Eric Dumazet wrote:
> On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote:
>>
>> From: Jon Maloy <jmaloy@redhat.com>
>>
>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>> a bug in the way tcp handles window advertising in extreme memory
>> squeeze situations.
>>
>> Under memory pressure, a socket endpoint may temporarily advertise
>> a zero-sized window, but this is not stored as part of the socket data.
>> The reasoning behind this is that it is considered a temporary setting
>> which shouldn't influence any further calculations.
>>
>> However, if we happen to stall at an unfortunate value of the current
>> window size, the algorithm selecting a new value will consistently fail
>> to advertise a non-zero window once we have freed up enough memory.
>> This means that this side's notion of the current window size is
>> different from the one last advertised to the peer, causing the latter
>> to not send any data to resolve the sitution.
>>
>> The problem occurs on the iperf3 server side, and the socket in question
>> is a completely regular socket with the default settings for the
>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>
>> The following excerpt of a logging session, with own comments added,
>> shows more in detail what is happening:
>>
>> //              tcp_v4_rcv(->)
>> //                tcp_rcv_established(->)
>> [5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
>> [5201<->39222]:     tcp_data_queue(->)
>> [5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>>                         [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                         [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
>>                         [OFO queue: gap: 65480, len: 0]
>> [5201<->39222]:     tcp_data_queue(<-)
>> [5201<->39222]:     __tcp_transmit_skb(->)
>>                          [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:       tcp_select_window(->)
>> [5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
>>                          [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>>                          returning 0
>> [5201<->39222]:       tcp_select_window(<-)
>> [5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
>> [5201<->39222]:     [__tcp_transmit_skb(<-)
>> [5201<->39222]:   tcp_rcv_established(<-)
>> [5201<->39222]: tcp_v4_rcv(<-)
>>
>> // Receive queue is at 85 buffers and we are out of memory.
>> // We drop the incoming buffer, although it is in sequence, and decide
>> // to send an advertisement with a window of zero.
>> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
>> // we unconditionally shrink the window.
>>
>> [5201<->39222]: tcp_recvmsg_locked(->)
>> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
>> [5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
>> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
>> [5201<->39222]:     NOT calling tcp_send_ack()
>>                      [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>>                    [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                    [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>>                    returning 6104 bytes
>> [5201<->39222]: tcp_recvmsg_locked(<-)
>>
>> // After each read, the algorithm for calculating the new receive
>> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
>> // or to update tp->rcv_wnd.
>> // Meanwhile, the peer thinks the window is zero, and will not send
>> // any more data to trigger an update from the interrupt mode side.
>>
>> [5201<->39222]: tcp_recvmsg_locked(->)
>> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
>> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
>> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
>> [5201<->39222]:     NOT calling tcp_send_ack()
>>                      [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>>                    [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                    [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>>                    returning 131072 bytes
>> [5201<->39222]: tcp_recvmsg_locked(<-)
>>
>> // The above pattern repeats again and again, since nothing changes
>> // between the reads.
>>
>> [...]
>>
>> [5201<->39222]: tcp_recvmsg_locked(->)
>> [5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
>> [5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
>> [5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
>> [5201<->39222]:     NOT calling tcp_send_ack()
>>                      [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
>> [5201<->39222]:   __tcp_cleanup_rbuf(<-)
>>                    [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>>                    [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>>                    returning 54672 bytes
>> [5201<->39222]: tcp_recvmsg_locked(<-)
>>
>> // The receive queue is empty, but no new advertisement has been sent.
>> // The peer still thinks the receive window is zero, and sends nothing.
>> // We have ended up in a deadlock situation.
> 
> This so-called 'deadlock' only occurs if a remote TCP stack is unable
> to send win0 probes.
> 
> In this case, sending some ACK will not help reliably if these ACK get lost.
> 
> I find the description tries very hard to hide a bug in another stack,
> for some reason.

I clearly stated in a previous comment that this was the case, and that
it has been fixed now. My reason for posting this is because I still
think this is a bug, just as I think the way we use rcv_ssthresh in 
_tcp_select)window() is a bug that eventually should be fixed.

> 
> When under memory stress, not sending an opening ACK as fast as possible,
> giving time for the host to recover from this memory stress was also a
> sensible thing to do.
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> 
> Thanks for the fix.
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-28 16:51   ` Jon Maloy
@ 2025-01-28 17:11     ` Eric Dumazet
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2025-01-28 17:11 UTC (permalink / raw)
  To: Jon Maloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	memnglong8.dong, kerneljasonxing, ncardwell, eric.dumazet

On Tue, Jan 28, 2025 at 5:51 PM Jon Maloy <jmaloy@redhat.com> wrote:

> I clearly stated in a previous comment that this was the case, and that
> it has been fixed now. My reason for posting this is because I still
> think this is a bug, just as I think the way we use rcv_ssthresh in
> _tcp_select)window() is a bug that eventually should be fixed.

I was referring to a wrong statement in the changelog, claiming a
'deadlock situation' ...

It is pretty clear there is no deadlock here, unless the remote TCP
stack is _absolutely_ _broken_.

If you still want to capture this in an official changelog, it would
be nice to clarify this,
to avoid yet another CVE to be filled based on scary sentences
misleading many teams
in the world.

Keep changelogs accurate and factual, so that we can find useful
signals in them.

All your __tcp_cleanup_rbuf() repetitions are simply noise. It does not matter
if it is called once or ten times.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [net,v3] tcp: correct handling of extreme memory squeeze
  2025-01-27 23:13 [net,v3] tcp: correct handling of extreme memory squeeze jmaloy
  2025-01-28  0:52 ` Jason Xing
  2025-01-28 15:04 ` Eric Dumazet
@ 2025-01-30  3:10 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-01-30  3:10 UTC (permalink / raw)
  To: Jon Maloy
  Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
	memnglong8.dong, kerneljasonxing, ncardwell, eric.dumazet,
	edumazet

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 27 Jan 2025 18:13:04 -0500 you wrote:
> From: Jon Maloy <jmaloy@redhat.com>
> 
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
> 
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
> 
> [...]

Here is the summary with links:
  - [net,v3] tcp: correct handling of extreme memory squeeze
    https://git.kernel.org/netdev/net/c/8c670bdfa58e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-30  3:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-27 23:13 [net,v3] tcp: correct handling of extreme memory squeeze jmaloy
2025-01-28  0:52 ` Jason Xing
2025-01-28 15:04 ` Eric Dumazet
2025-01-28 15:18   ` Stefano Brivio
2025-01-28 15:56   ` Neal Cardwell
2025-01-28 16:01     ` Eric Dumazet
2025-01-28 16:51   ` Jon Maloy
2025-01-28 17:11     ` Eric Dumazet
2025-01-30  3:10 ` patchwork-bot+netdevbpf

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).