From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: passt.top; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=cwtSSic7; dkim-atps=neutral Received: from mail-io1-xd35.google.com (mail-io1-xd35.google.com [IPv6:2607:f8b0:4864:20::d35]) by passt.top (Postfix) with ESMTPS id BC4515A0274 for ; Sat, 18 Jan 2025 18:01:59 +0100 (CET) Received: by mail-io1-xd35.google.com with SMTP id ca18e2360f4ac-84a012f7232so136996539f.0 for ; Sat, 18 Jan 2025 09:01:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737219718; x=1737824518; darn=passt.top; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mipqSWqCQw7mZMk4ZQpoBkHYub16XF6bdcgmVQvyxtA=; b=cwtSSic7VpzBQLOmXJdo7PnyG9MycCF/vbBA1+K/nyFipj2QOorlMpryJWS/2YsYW7 xUa8ICJtVUpg5DtTaRTFjFX7wmdSzAIQrdKJsIuKDayRZMDx9SuppHRUKF5MipWxkp2T CtnhvYT0zQiFajoeGiueCPbq0X7m0VcWb1lnDQoa0ht2xcrcrNYp/mhDVHNQAIgnnAoy SkhbL5wEUXWYqscgJMCl2+jq16keCgyL+zj24/IQWaxkUs7vzbfGZx+Za9lm158Z/Efh ZGsEse07PfoYHZBS10l5C2kKhj1/HYzrtLQDq8EYoMvxhZpgJxe1M6LAhQ00P1jB0jGl z0Xg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737219718; x=1737824518; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mipqSWqCQw7mZMk4ZQpoBkHYub16XF6bdcgmVQvyxtA=; b=Ztd28aNoY1jAY2nuzNnZhzDTVr1mIGZmXbTbE83Q/UsoSjR7cwxp4VufNZSI+2gDO+ CHUizO1/vxCkgVrz7KTNtwYweT2ls+vP5CAejHeGz5NDzsIeSzEGEpa5fFMMiIqph/ce Bar8xBmgQ9CPWsBwlHjcsCZdMYKBEZlUZ22NHMrwsoFbK3Ha0EWjdjR4dFiwrdqompeF rwyuWKopc/AxtldqYRraYpYdZtWssJKwJ+5x03ubvg2uTNA2d8pIvxqa/SAqnfTu7eoS 8aem6um370FN6HtKdV06B+mqVu0lE9zDkte8RWCCjZnb5MBA9NxsL4vM0iLrHMSlZl6Q FIvQ== X-Forwarded-Encrypted: i=1; AJvYcCX9FOEqyM4X5xoEnmzMr4EzwNWfpVxx3WPM0mLenGSSW+wiEDVB6H19LWOd5FvYJ8snLhE+RGHGzYg=@passt.top X-Gm-Message-State: AOJu0Yw+K1ic0oCl/qQcHvQvwh9hui/ltE+/7PvYIxNXWTCh4B0oQGqo z7lf1Q615Rq79sXnFJtxLi0yeNO3ZX2aZuuS83NmqefQDInb3MqOOY3jMvzC6Fv++7NBVOUxnO+ KB0de9Iz4j2IeVK+xEmytPHmkbh0= X-Gm-Gg: ASbGnct0n/AUG86qBvqcLKsG/WRwZ4pUIp2nOEMznCxgbYTZ+2/LHXK14J61NxA98yF Z8+jaLBNEPu6S22IBfzFjE5mWiBAkCmm1Dz98eYBz/caz05ZzV54= X-Google-Smtp-Source: AGHT+IF10JQwcOIvtPv3CY+5fwkcF932UvYFvzR6ETJE6IlLVPtI8GFJUiuRsMqWJj00qZsMlmmWIvRqzVEuw+yhw1c= X-Received: by 2002:a05:6e02:8e:b0:3ce:7b33:8c3b with SMTP id e9e14a558f8ab-3cf747e2027mr53153855ab.5.1737219718316; Sat, 18 Jan 2025 09:01:58 -0800 (PST) MIME-Version: 1.0 References: <20250117214035.2414668-1-jmaloy@redhat.com> In-Reply-To: <20250117214035.2414668-1-jmaloy@redhat.com> From: Jason Xing Date: Sun, 19 Jan 2025 01:01:22 +0800 X-Gm-Features: AbW1kvaM0FI6R_4QP_9t3-02uU6hJnG4ABj-eRb_TNp2n5h6Q2StvfHd4MEFDvk Message-ID: Subject: Re: [net,v2] tcp: correct handling of extreme memory squeeze To: jmaloy@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: kerneljasonxing@gmail.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: APOBUQJIZ4DQVG3SB6NMGSHF5ZZVZPDW X-Message-ID-Hash: APOBUQJIZ4DQVG3SB6NMGSHF5ZZVZPDW X-Mailman-Approved-At: Sat, 18 Jan 2025 18:12:29 +0100 CC: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, imagedong@tencent.com, eric.dumazet@gmail.com, edumazet@google.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Sat, Jan 18, 2025 at 5:41=E2=80=AFAM wrote: > > From: Jon Maloy > > Testing with iperf3 using the "pasta" protocol splicer has revealed > a bug in the way tcp handles window advertising in extreme memory > squeeze situations. > > Under memory pressure, a socket endpoint may temporarily advertise > a zero-sized window, but this is not stored as part of the socket data. > The reasoning behind this is that it is considered a temporary setting > which shouldn't influence any further calculations. > > However, if we happen to stall at an unfortunate value of the current > window size, the algorithm selecting a new value will consistently fail > to advertise a non-zero window once we have freed up enough memory. > This means that this side's notion of the current window size is > different from the one last advertised to the peer, causing the latter > to not send any data to resolve the sitution. > > The problem occurs on the iperf3 server side, and the socket in question > is a completely regular socket with the default settings for the > fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. > > The following excerpt of a logging session, with own comments added, > shows more in detail what is happening: > > // tcp_v4_rcv(->) > // tcp_rcv_established(->) > [5201<->39222]: =3D=3D=3D=3D Activating log @ net/ipv4/tcp_input.c/tc= p_data_queue()/5257 =3D=3D=3D=3D > [5201<->39222]: tcp_data_queue(->) > [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_D= ROP_REASON_PROTO_MEM > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469= 200, win_now 131184] > [copied_seq 259909392->260034360 (124968), unread = 5565800, qlen 85, ofoq 0] > [5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0 > [5201<->39222]: __tcp_transmit_skb(->) > [5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->r= cv_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)= --> TRUE > [5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->r= cv_wnd: 262144, tp->rcv_nxt 265600160, returning 0 > [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 > [5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rc= v_wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_rcv_established(<-) > [5201<->39222]: tcp_v4_rcv(<-) > > // Receive queue is at 85 buffers and we are out of memory. > // We drop the incoming buffer, although it is in sequence, and decide > // to send an advertisement with a window of zero. > // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means > // we unconditionally shrink the window. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 0, win_now =3D 131184, 2 * win_now =3D 2= 62368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 260040464->260040464 (0), unread 5559696, qle= n 85, ofoq 0] > > // After each read, the algorithm for calculating the new receive > // window in __tcp_cleanup_rbuf() finds it is too small to advertise > // or to update tp->rcv_wnd. > // Meanwhile, the peer thinks the window is zero, and will not send > // any more data to trigger an update from the interrupt mode side. > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 260099840->260171536 (71696), unread 5428624,= qlen 83, ofoq 0] > > // The above pattern repeats again and again, since nothing changes > // between the reads. > > [...] > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 265469200->265545488 (76288), unread 54672, q= len 1, ofoq 0] > > [5201<->39222]: tcp_recvmsg_locked(->) > [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: [new_win =3D 262144, win_now =3D 131184, 2 * win_now = =3D 262368] > [5201<->39222]: [new_win >=3D (2 * win_now) ? --> time_to_ack =3D 0] > [5201<->39222]: NOT calling tcp_send_ack() > [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_= wnd: 262144, tp->rcv_nxt 265600160 > [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes. > [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, wi= n_now 131184] > [copied_seq 265600160->265600160 (0), unread 0, qlen 0, o= foq 0] > > // The receive queue is empty, but no new advertisement has been sent. > // The peer still thinks the receive window is zero, and sends nothing. > // We have ended up in a deadlock situation. > > Furthermore, we have observed that in these situations this side may > send out an updated 'th->ack_seq=C2=B4 which is not stored in tp->rcv_wup > as it should be. Backing ack_seq seems to be harmless, but is of > course still wrong from a protocol viewpoint. > > We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet > has been dropped because of memory exhaustion and we have to advertize > a zero window. > > Further testing shows that the connection recovers neatly from the > squeeze situation, and traffic can continue indefinitely. Thanks for such a detailed explanation :) However, undeniably, I've read several times until I totally follow the issue you reported. I wonder if you can compact and simplify the commit message a little bit, say, remove the last two calltraces that can be replaced by a single sentence like you've already stated (which is "the queue is empty"). Let me rephrase the scenario to see if I'm understanding in the right way: 1. First time the receiver sends a skb with zero window because of shortage of memory. 2. Sometimes the user calls the tcp_recvmsg() to remove the data from kernel to user space, sending the ACK process should be triggered, especially when the queue is empty. 3. Then the client reading the zero window signal stops sending data. #1 fails to update the rcv_wup and rcv_wnd, so since then in the manner of __tcp_cleanup_rbuf(), the rcv_window_now variable is always the same in this case (please see the calculation in tcp_receive_window()). rcv_window_now, which I presume is the same as win_now in the above description, should vary when __tcp_cleanup_rbuf() is called. Because of that, #2 could never send an ACK. My small question is why is the new_window always 262144 no matter what amount of data the user receives? Is it because in __tcp_select_window() there is no wscale option? Thanks, Jason