From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: passt.top; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20230601 header.b=tlph85Vn; dkim-atps=neutral Received: from mail-ed1-x530.google.com (mail-ed1-x530.google.com [IPv6:2a00:1450:4864:20::530]) by passt.top (Postfix) with ESMTPS id BA8885A061C for ; Tue, 28 Jan 2025 16:04:47 +0100 (CET) Received: by mail-ed1-x530.google.com with SMTP id 4fb4d7f45d1cf-5dbfab8a2b0so11165230a12.3 for ; Tue, 28 Jan 2025 07:04:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738076687; x=1738681487; darn=passt.top; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LGL2HN8Fi3DZ+QAMnHPNFn5nofKWEuhiofUzFLXfui4=; b=tlph85VncCkEbZVF0gYXWMWRbBgKV6vP+skQ/QysJHnBFkqRQeRukKze4CVT+ZOQps W7Dx3UobHmpi7HxRqIo57A9MVo4ol4dk3VCD8cWdcetXFNkUi2RwvrxwZndxmvjA4ur0 GKrV09tjWnJPws5O+rSMtUfvapJE8i/bF1BprXK90EfZ4COhnT+USHK6u9RiBM94DOSz xzzJ5q5yy3pQAZ6Hu5yFQ6UcCKZ+wZzvpJ5sJAi3lbzOxiArVwcsKcI4WdorEPS6WYDR nK98iMOeUmOVIu53MVlFponkg/PFBy71RvhXC02zSINxjCVkYTgOun8b518y7MzxxQHC bcdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738076687; x=1738681487; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LGL2HN8Fi3DZ+QAMnHPNFn5nofKWEuhiofUzFLXfui4=; b=iX+W64a3oARQUDw5PB8uSdZb6Ssiig/iEQri/bMNYhr+ZcKyKO8HnWy/eTPm/4kei8 4V7hg0hN/EZ7bu+NEDV73uXng0ilUaKhA6OkEvHC2uK/ZMYMkBVYGKu6JIezoj5uDNSB 0eCikr2TgYm5NOK2J0/GStsg1lPpKxog9bShCLb/BT0+Oe8eoa0vZD2zhdTasAjAdhnC 0naOC5PK3GCZMM5S599VE3ypFEsl5bcFsICDXgIgVAJlPRAd9l9WaxBD20n4Y1RHaqRp LJV8owJV1x3my6hyXXFD8J6qXnfJr0SKb67pIb/Ifb4XXYKbK74e+oLXG7YyP2J0dH2C MubA== X-Forwarded-Encrypted: i=1; AJvYcCVuy42XDcqMTRsReOhfrptrWu7PxvKMw+UZL+zzMPZo2wlU34XrOHEXnIJLqjViZ1+/Y2Gj3HLd3x4=@passt.top X-Gm-Message-State: AOJu0YyFjJvHTzGTbagex7fI9qxz+0J3briycN6ubwRLt2nOYcTS/PKf 4wow5DKfa+jfQpDEfqv+vdXn68fx89mX7pUnFJ75//yZTOYJYQvkEzIYMrYU3JDdILTQ/Ly3PMe XktAa17nUN8fUxahFcqjOL3cUv9U6f4K937uw X-Gm-Gg: ASbGncuz+QKT8bnBHx92ZcJMD40hYRLXpLPv/i5pguCO23jBUXVNr9shZjnsEEXB+7k Hdg2lV0ygrn+q1yZbIAmhlLf8HAgy73L/F1qq5NtPt872CuzTLLBcc6wYfvxt07hXv6svnYKA X-Google-Smtp-Source: AGHT+IG4TfeiDzyLV55UIi3oKvEjswQDFrkuEId3QHk3G2v8/Fk7yBARO4qNH5VDIlVndtab7nqCFhMiLnrTNvVioSw= X-Received: by 2002:a05:6402:1d53:b0:5dc:10fe:4d6b with SMTP id 4fb4d7f45d1cf-5dc10fe5068mr14887397a12.8.1738076686732; Tue, 28 Jan 2025 07:04:46 -0800 (PST) MIME-Version: 1.0 References: <20250127231304.1465565-1-jmaloy@redhat.com> In-Reply-To: <20250127231304.1465565-1-jmaloy@redhat.com> From: Eric Dumazet Date: Tue, 28 Jan 2025 16:04:35 +0100 X-Gm-Features: AWEUYZkVMA_GntVpsZOW3H6t5RGIw4P6CdrSeoblQDHq3rqfDn6c2ELbk1Himzo Message-ID: Subject: Re: [net,v3] tcp: correct handling of extreme memory squeeze To: jmaloy@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: edumazet@google.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: BOEX2J43EOZQFQ7ECRQR5YJF4XEI7AVJ X-Message-ID-Hash: BOEX2J43EOZQFQ7ECRQR5YJF4XEI7AVJ X-Mailman-Approved-At: Tue, 28 Jan 2025 17:05:29 +0100 CC: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, memnglong8.dong@gmail.com, kerneljasonxing@gmail.com, ncardwell@google.com, eric.dumazet@gmail.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, Jan 28, 2025 at 12:13=E2=80=AFAM wrote: > > From: Jon Maloy > > Testing with iperf3 using the "pasta" protocol splicer has revealed > a bug in the way tcp handles window advertising in extreme memory > squeeze situations. > > Under memory pressure, a socket endpoint may temporarily advertise > a zero-sized window, but this is not stored as part of the socket data. > The reasoning behind this is that it is considered a temporary setting > which shouldn't influence any further calculations. > > However, if we happen to stall at an unfortunate value of the current > window size, the algorithm selecting a new value will consistently fail > to advertise a non-zero window once we have freed up enough memory. > This means that this side's notion of the current window size is > different from the one last advertised to the peer, causing the latter > to not send any data to resolve the sitution. > > The problem occurs on the iperf3 server side, and the socket in question > is a completely regular socket with the default settings for the > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. > > The following excerpt of a logging session, with own comments added, > shows more in detail what is happening: > > // tcp_v4_rcv(->) > // tcp_rcv_established(->) > [5201<->39222]: =3D=3D=3D=3D Activating log @ net/ipv4/tcp_input.c/tc= p_data_queue()/5257 =3D=3D=3D=3D > [5201<->39222]: tcp_data_queue(->) > [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_D= ROP_REASON_PROTO_MEM > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469= 200, win_now 131184] > [copied_seq 259909392->260034360 (124968), unread = 5565800, qlen 85, ofoq 0] > [OFO queue: gap: 65480, len: 0] > [5201<->39222]: tcp_data_queue(<-) > [5201<->39222]: __tcp_transmit_skb(->) > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp-= >rcv_nxt 265600160] > [5201<->39222]: tcp_select_window(->) > [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)= ? --> TRUE > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp-= >rcv_nxt 265600160] > returning 0 > [5201<->39222]: tcp_select_window(<-) > [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 > [5201<->39222]: [__tcp_transmit_skb(<-) > [5201<->39222]: tcp_rcv_established(<-) > [5201<->39222]: tcp_v4_rcv(<-) > > // Receive queue is at 85 buffers and we are out of memory. > // We drop the incoming buffer, although it is in sequence, and decide > // to send an advertisement with a window of zero. > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means > // we unconditionally shrink the window. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 0, win_now =3D 131184, 2 * win_now =3D 2= 62368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv= _nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, = win_now 131184] > [copied_seq 260040464->260040464 (0), unread 5559696, q= len 85, ofoq 0] > returning 6104 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // After each read, the algorithm for calculating the new receive > // window in __tcp_cleanup_rbuf() finds it is too small to advertise > // or to update tp->rcv_wnd. > // Meanwhile, the peer thinks the window is zero, and will not send > // any more data to trigger an update from the interrupt mode side. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv= _nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, = win_now 131184] > [copied_seq 260099840->260171536 (71696), unread 542862= 4, qlen 83, ofoq 0] > returning 131072 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // The above pattern repeats again and again, since nothing changes > // between the reads. > > [...] > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv= _nxt 265600160] > [5201<->39222]: __tcp_cleanup_rbuf(<-) > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, = win_now 131184] > [copied_seq 265600160->265600160 (0), unread 0, qlen 0,= ofoq 0] > returning 54672 bytes > [5201<->39222]: tcp_recvmsg_locked(<-) > > // The receive queue is empty, but no new advertisement has been sent. > // The peer still thinks the receive window is zero, and sends nothing. > // We have ended up in a deadlock situation. This so-called 'deadlock' only occurs if a remote TCP stack is unable to send win0 probes. In this case, sending some ACK will not help reliably if these ACK get lost= . I find the description tries very hard to hide a bug in another stack, for some reason. When under memory stress, not sending an opening ACK as fast as possible, giving time for the host to recover from this memory stress was also a sensible thing to do. Reviewed-by: Eric Dumazet Thanks for the fix.