From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 3E9BA5A026F for ; Thu, 4 Apr 2024 00:58:38 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1712185117; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fLO2vFHzm0QUrgEbUbclMCirigxMS4qqEcqa9++XoR8=; b=S//n0Q1VXIdrh3/nISmGIOR1qXlDnJi/3IWzzvq1a115yr3rHFwhzFkoeD0nj6kX7rKGnt ew0lcYXzbf1rx2h7QKKvpqZXsAp2wpudzPcj2Xd4ac0+gtxHMncsEB2F0osAErqHx1GFFl X8VUXqU/mpFJCNOK6MKi0n4Thqz+VjU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-163-9P7DsuioMzS7hAatIau9jA-1; Wed, 03 Apr 2024 18:58:35 -0400 X-MC-Unique: 9P7DsuioMzS7hAatIau9jA-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A1ED288CE02 for ; Wed, 3 Apr 2024 22:58:35 +0000 (UTC) Received: from fenrir.redhat.com (unknown [10.22.10.69]) by smtp.corp.redhat.com (Postfix) with ESMTP id 41BD0492BCD; Wed, 3 Apr 2024 22:58:35 +0000 (UTC) From: Jon Maloy To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com Subject: [net-next 2/2] tcp: correct handling of extreme menory squeeze Date: Wed, 3 Apr 2024 18:58:33 -0400 Message-ID: <20240403225833.123346-3-jmaloy@redhat.com> In-Reply-To: <20240403225833.123346-1-jmaloy@redhat.com> References: <20240403225833.123346-1-jmaloy@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.10 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Message-ID-Hash: NZZZITQJTICWYQKR3BR7JLCFVCZNUE6T X-Message-ID-Hash: NZZZITQJTICWYQKR3BR7JLCFVCZNUE6T X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Testing of the previous commit ("tcp: add support for SO_PEEK_OFF") in this series along with the pasta protocol splicer revealed a bug in the way tcp handles window advertising during extreme memory squeeze situations. The excerpt of the below logging session shows what is happeing: [5201<->54494]: ==== Activating log @ tcp_select_window()/268 ==== [5201<->54494]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE [5201<->54494]: tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0 [5201<->54494]: ADVERTISING WINDOW SIZE 0 [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83 [...] [5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1 [5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0 [5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0 We can see that although we are adverising a window size of zero, tp->rcv_wnd is not updated accordingly. This leads to a discrepancy between this side's and the peer's view of the current window size. - The peer thinks the window is zero, and stops sending. - This side ends up in a cycle where it repeatedly caclulates a new window size it finds too small to advertise. Hence no messages are received, and no acknowledges are sent, and the situation remains locked even after the last queued receive buffer has been consumed. We fix this by setting tp->rcv_wnd to 0 before we return from the function tcp_select_window() in this particular case. Further testing shows that the connection recovers neatly from the squeeze situation, and traffic can continue indefinitely. Signed-off-by: Jon Maloy --- net/ipv4/tcp_output.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e3167ad96567..5803fd402708 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -264,8 +264,11 @@ static u16 tcp_select_window(struct sock *sk) * are out of memory. The window is temporary, so we don't store * it on the socket. */ - if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) + if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) { + tp->rcv_wnd = 0; + tp->rcv_wup = tp->rcv_nxt; return 0; + } cur_win = tcp_receive_window(tp); new_win = __tcp_select_window(sk); -- 2.42.0