From: Eric Dumazet <edumazet@google.com>
To: jmaloy@redhat.com
Cc: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org,
passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
dgibson@redhat.com, memnglong8.dong@gmail.com,
kerneljasonxing@gmail.com, ncardwell@google.com,
eric.dumazet@gmail.com
Subject: Re: [net,v3] tcp: correct handling of extreme memory squeeze
Date: Tue, 28 Jan 2025 16:04:35 +0100 [thread overview]
Message-ID: <CANn89i+x2RGHDA6W-oo=Hs8bM=4Ao_aAKFsRrFhq=U133j+FvA@mail.gmail.com> (raw)
In-Reply-To: <20250127231304.1465565-1-jmaloy@redhat.com>
On Tue, Jan 28, 2025 at 12:13 AM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> // tcp_v4_rcv(->)
> // tcp_rcv_established(->)
> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]: tcp_data_queue(->)
> [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> [OFO queue: gap: 65480, len: 0]
> [5201<->39222]: tcp_data_queue(<-)
> [5201<->39222]: __tcp_transmit_skb(->)
> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]: tcp_select_window(->)
> [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> returning 0
> [5201<->39222]: tcp_select_window(<-)
> [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]: [__tcp_transmit_skb(<-)
> [5201<->39222]: tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]: __tcp_cleanup_rbuf(<-)
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
> returning 6104 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]: __tcp_cleanup_rbuf(<-)
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
> returning 131072 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
> [5201<->39222]: __tcp_cleanup_rbuf(<-)
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
> returning 54672 bytes
> [5201<->39222]: tcp_recvmsg_locked(<-)
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
This so-called 'deadlock' only occurs if a remote TCP stack is unable
to send win0 probes.
In this case, sending some ACK will not help reliably if these ACK get lost.
I find the description tries very hard to hide a bug in another stack,
for some reason.
When under memory stress, not sending an opening ACK as fast as possible,
giving time for the host to recover from this memory stress was also a
sensible thing to do.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Thanks for the fix.
next prev parent reply other threads:[~2025-01-28 15:04 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-27 23:13 [net,v3] tcp: correct handling of extreme memory squeeze jmaloy
2025-01-28 0:52 ` Jason Xing
2025-01-28 15:04 ` Eric Dumazet [this message]
2025-01-28 15:18 ` Stefano Brivio
2025-01-28 15:56 ` Neal Cardwell
2025-01-28 16:01 ` Eric Dumazet
2025-01-28 16:51 ` Jon Maloy
2025-01-28 17:11 ` Eric Dumazet
2025-01-30 3:10 ` patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CANn89i+x2RGHDA6W-oo=Hs8bM=4Ao_aAKFsRrFhq=U133j+FvA@mail.gmail.com' \
--to=edumazet@google.com \
--cc=davem@davemloft.net \
--cc=dgibson@redhat.com \
--cc=eric.dumazet@gmail.com \
--cc=jmaloy@redhat.com \
--cc=kerneljasonxing@gmail.com \
--cc=kuba@kernel.org \
--cc=lvivier@redhat.com \
--cc=memnglong8.dong@gmail.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).