From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: passt.top; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20230601 header.b=OGLIMCqW; dkim-atps=neutral Received: from mail-qt1-x831.google.com (mail-qt1-x831.google.com [IPv6:2607:f8b0:4864:20::831]) by passt.top (Postfix) with ESMTPS id BE3E45A061C for <passt-dev@passt.top>; Tue, 28 Jan 2025 16:57:00 +0100 (CET) Received: by mail-qt1-x831.google.com with SMTP id d75a77b69052e-4679b5c66d0so247101cf.1 for <passt-dev@passt.top>; Tue, 28 Jan 2025 07:57:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738079819; x=1738684619; darn=passt.top; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mC+LZicoEWTDrXa8gIlyMQe9ZNUpFhkwdWfwBf9KAKs=; b=OGLIMCqWq16SZjI/8jehriZywU6OtQALDtjtPIvvjqr9JHdWuSn3vLtDm8TDsBEWC5 4/O78VYV9qMu3H5MxCDa3uKtmA+EShv78FeAjXk2mfAvDNSEj5TvEM0MGTGrJn+pSLSY zixByjjk9dIY5Xs/22SRUg3Rc06O5IID0EGDIqaLLo8MDWXkp0r8tXilYxQZd0hPqQxu qxPtDlw0l8f7igDJ6AyzLHEVHO95J14kVvhCUJnB02kYW4Nbz01eB1PIJXFcXxzYtSal bALOH7w4mPDDwxu4KGs8h5JfzWvOxvu2V64n3x3POfk/f3Amov4/3U07B/K776wPohZV 9TGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738079819; x=1738684619; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mC+LZicoEWTDrXa8gIlyMQe9ZNUpFhkwdWfwBf9KAKs=; b=Suml42szhqTe+y74qWU2rwrbe23UksCwBd0mF34wj30+GpPHM4Vz3qkTFz6LLyCezq XH/+wuqj612t4bXyrcYlxv1FRO/+ZxLjcPuGwFdcPseFrmbcHu1aW1PnphNyYoGZSO6t uDZZhm/QES+9wB+HBp5msSmTcwxpKvA0UqiiTkE0Gmz3fBg9eDucXyGK/NahnbFLw5Pi cKoUVwZPLsgr0PioRDzzP6DFItn16P0B+ejlHqnCoAG5pEBWrimsiIWibKVk/vgXnPJM 1ERo3Jqn1Rxh92HVJI1JVD78NBuH3kMPWFtHTY+8kM8SE4XfOEIZg83gcZnQy7EYmtCk oFBg== X-Forwarded-Encrypted: i=1; AJvYcCVfF5Lg7yAsbEZF0EW/+PpkiifP6LAZXBAx6rylWFMM52i0+qNCGzbvCefxCNZilNx8WJ6zqYyGTMw=@passt.top X-Gm-Message-State: AOJu0YxKfhdHhZj0kdx0A1P2H7oM5kBMBpaePRqKQPyfet3BpkPRgfyT 0D/ZPpKIwcDgMhEg7qsbzf6GnBdzofZCFyqhoCNUh13jVncNt5LHTRiKpEZLSDGUJ1x/EPlpTS+ krZnPlIRjFpsmaVS1nRNUF+tAya90C9SnhkBP X-Gm-Gg: ASbGncv3x+NyvHUzwcnF5SNQXUFD/zsFvL+lBKbQguleJ4wGsHTjNw0P4wrndJgeQ6X 7k99LU5FFcasNevgMSrNkzSUUo/OEJbGrnlg9kJwcdoIST8UHZbs88ApnjSemtfuPjfFtShpAsD f+n2Ig3dQdxAEK3RTTdjc9+FlThZMMzQ== X-Google-Smtp-Source: AGHT+IFGsQQGg+KSO7AaQLkWSInBTpx3GfAcqnDaZful/E+cvDQIv65rsHwTny6oCmFrZuQ2dyl5joTdhyKfL9O3D1Q= X-Received: by 2002:ac8:59c5:0:b0:466:a11c:cad2 with SMTP id d75a77b69052e-46fc631483amr3358481cf.7.1738079819459; Tue, 28 Jan 2025 07:56:59 -0800 (PST) MIME-Version: 1.0 References: <20250127231304.1465565-1-jmaloy@redhat.com> <CANn89i+x2RGHDA6W-oo=Hs8bM=4Ao_aAKFsRrFhq=U133j+FvA@mail.gmail.com> In-Reply-To: <CANn89i+x2RGHDA6W-oo=Hs8bM=4Ao_aAKFsRrFhq=U133j+FvA@mail.gmail.com> From: Neal Cardwell <ncardwell@google.com> Date: Tue, 28 Jan 2025 10:56:43 -0500 X-Gm-Features: AWEUYZl9RKC106JC04jAYPE4HVeowf_BD0ZYPUAWJTPS297Hu8OW2_LA3LyC3is Message-ID: <CADVnQyn7afmGhuUOEzvFV099476pxrAUHE+FVnmiwwbo1tu1oA@mail.gmail.com> Subject: Re: [net,v3] tcp: correct handling of extreme memory squeeze To: Eric Dumazet <edumazet@google.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: ncardwell@google.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: EL4ZFIUA22KSOA7NDOEVHP2NU6IAQZS7 X-Message-ID-Hash: EL4ZFIUA22KSOA7NDOEVHP2NU6IAQZS7 X-Mailman-Approved-At: Tue, 28 Jan 2025 17:05:29 +0100 CC: jmaloy@redhat.com, netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, memnglong8.dong@gmail.com, kerneljasonxing@gmail.com, eric.dumazet@gmail.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt <passt-dev.passt.top> Archived-At: <https://archives.passt.top/passt-dev/CADVnQyn7afmGhuUOEzvFV099476pxrAUHE+FVnmiwwbo1tu1oA@mail.gmail.com/> Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/EL4ZFIUA22KSOA7NDOEVHP2NU6IAQZS7/> List-Archive: <https://archives.passt.top/passt-dev/> List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/> List-Help: <mailto:passt-dev-request@passt.top?subject=help> List-Owner: <mailto:passt-dev-owner@passt.top> List-Post: <mailto:passt-dev@passt.top> List-Subscribe: <mailto:passt-dev-join@passt.top> List-Unsubscribe: <mailto:passt-dev-leave@passt.top> On Tue, Jan 28, 2025 at 10:04=E2=80=AFAM Eric Dumazet <edumazet@google.com>= wrote: > > On Tue, Jan 28, 2025 at 12:13=E2=80=AFAM <jmaloy@redhat.com> wrote: > > > > From: Jon Maloy <jmaloy@redhat.com> > > > > Testing with iperf3 using the "pasta" protocol splicer has revealed > > a bug in the way tcp handles window advertising in extreme memory > > squeeze situations. > > > > Under memory pressure, a socket endpoint may temporarily advertise > > a zero-sized window, but this is not stored as part of the socket data. > > The reasoning behind this is that it is considered a temporary setting > > which shouldn't influence any further calculations. > > > > However, if we happen to stall at an unfortunate value of the current > > window size, the algorithm selecting a new value will consistently fail > > to advertise a non-zero window once we have freed up enough memory. > > This means that this side's notion of the current window size is > > different from the one last advertised to the peer, causing the latter > > to not send any data to resolve the sitution. > > > > The problem occurs on the iperf3 server side, and the socket in questio= n > > is a completely regular socket with the default settings for the > > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. > > > > The following excerpt of a logging session, with own comments added, > > shows more in detail what is happening: > > > > // tcp_v4_rcv(->) > > // tcp_rcv_established(->) > > [5201<->39222]: =3D=3D=3D=3D Activating log @ net/ipv4/tcp_input.c/= tcp_data_queue()/5257 =3D=3D=3D=3D > > [5201<->39222]: tcp_data_queue(->) > > [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB= _DROP_REASON_PROTO_MEM > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 2654= 69200, win_now 131184] > > [copied_seq 259909392->260034360 (124968), unrea= d 5565800, qlen 85, ofoq 0] > > [OFO queue: gap: 65480, len: 0] > > [5201<->39222]: tcp_data_queue(<-) > > [5201<->39222]: __tcp_transmit_skb(->) > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, t= p->rcv_nxt 265600160] > > [5201<->39222]: tcp_select_window(->) > > [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOME= M) ? --> TRUE > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, t= p->rcv_nxt 265600160] > > returning 0 > > [5201<->39222]: tcp_select_window(<-) > > [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 > > [5201<->39222]: [__tcp_transmit_skb(<-) > > [5201<->39222]: tcp_rcv_established(<-) > > [5201<->39222]: tcp_v4_rcv(<-) > > > > // Receive queue is at 85 buffers and we are out of memory. > > // We drop the incoming buffer, although it is in sequence, and decide > > // to send an advertisement with a window of zero. > > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means > > // we unconditionally shrink the window. > > > > [5201<->39222]: tcp_recvmsg_locked(->) > > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rc= v_wnd: 262144, tp->rcv_nxt 265600160 > > [5201<->39222]: [new_win =3D 0, win_now =3D 131184, 2 * win_now =3D= 262368] > > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0= ] > > [5201<->39222]: NOT calling tcp_send_ack() > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->r= cv_nxt 265600160] > > [5201<->39222]: __tcp_cleanup_rbuf(<-) > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200= , win_now 131184] > > [copied_seq 260040464->260040464 (0), unread 5559696,= qlen 85, ofoq 0] > > returning 6104 bytes > > [5201<->39222]: tcp_recvmsg_locked(<-) > > > > // After each read, the algorithm for calculating the new receive > > // window in __tcp_cleanup_rbuf() finds it is too small to advertise > > // or to update tp->rcv_wnd. > > // Meanwhile, the peer thinks the window is zero, and will not send > > // any more data to trigger an update from the interrupt mode side. > > > > [5201<->39222]: tcp_recvmsg_locked(->) > > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rc= v_wnd: 262144, tp->rcv_nxt 265600160 > > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_no= w =3D 262368] > > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0= ] > > [5201<->39222]: NOT calling tcp_send_ack() > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->r= cv_nxt 265600160] > > [5201<->39222]: __tcp_cleanup_rbuf(<-) > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200= , win_now 131184] > > [copied_seq 260099840->260171536 (71696), unread 5428= 624, qlen 83, ofoq 0] > > returning 131072 bytes > > [5201<->39222]: tcp_recvmsg_locked(<-) > > > > // The above pattern repeats again and again, since nothing changes > > // between the reads. > > > > [...] > > > > [5201<->39222]: tcp_recvmsg_locked(->) > > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rc= v_wnd: 262144, tp->rcv_nxt 265600160 > > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_no= w =3D 262368] > > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0= ] > > [5201<->39222]: NOT calling tcp_send_ack() > > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->r= cv_nxt 265600160] > > [5201<->39222]: __tcp_cleanup_rbuf(<-) > > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200= , win_now 131184] > > [copied_seq 265600160->265600160 (0), unread 0, qlen = 0, ofoq 0] > > returning 54672 bytes > > [5201<->39222]: tcp_recvmsg_locked(<-) > > > > // The receive queue is empty, but no new advertisement has been sent. > > // The peer still thinks the receive window is zero, and sends nothing. > > // We have ended up in a deadlock situation. > > This so-called 'deadlock' only occurs if a remote TCP stack is unable > to send win0 probes. > > In this case, sending some ACK will not help reliably if these ACK get lo= st. > > I find the description tries very hard to hide a bug in another stack, > for some reason. > > When under memory stress, not sending an opening ACK as fast as possible, > giving time for the host to recover from this memory stress was also a > sensible thing to do. > > Reviewed-by: Eric Dumazet <edumazet@google.com> > > Thanks for the fix. Yes, thanks for the fix. LGTM as well. Reviewed-by: Neal Cardwell <ncardwell@google.com> BTW, IMHO it would be nice to have some sort of NET_INC_STATS() of an SNMP stat for this case, since we have SNMP stat increases for other 0-window cases. That could help debugging performance problems from memory pressure and zero windows. But that can be in a separate patch for net-next once this fix is in net-next. neal