From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: passt.top; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=CX6MSjGE; dkim-atps=neutral Received: from mail-yb1-xb41.google.com (mail-yb1-xb41.google.com [IPv6:2607:f8b0:4864:20::b41]) by passt.top (Postfix) with ESMTPS id 762FF5A061E for ; Mon, 27 Jan 2025 14:38:24 +0100 (CET) Received: by mail-yb1-xb41.google.com with SMTP id 3f1490d57ef6-e580d6211c8so7153588276.1 for ; Mon, 27 Jan 2025 05:38:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737985103; x=1738589903; darn=passt.top; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/m4bNax3mvm571AsnC8jtRusVwQLXsEzhjxRPyvW2ko=; b=CX6MSjGEeV0wM7eokPok/KkGnG9CUUavnBrdKFiMPEpUdxg6xJq5yVYtWtH8v1ynqX YdCUIrlJxtCGCYsb90mndBf1kzS6nN0Ju31nIdAXlIjVXH2lUBj6JqCMVCVwa2K5IrHT qOac+y98dhMrB6FWFj+vpaCJxfcm7kcTs1/0p9RNurRvvkopCdkPju2IkyRwoisB+9DQ l6KTUnNX4BPr9SiDnrhsw1HeajWngFDHYSn+6BaBGGcjjlzWUZBXxIEOXTclVtWGakAc X+29hTXZQmtH+lMTjxrTyS8IPiN0EXikNDeQygwkaTfOhAUUlHDD5KEn8cIpuHs+HlVv SzQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737985103; x=1738589903; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/m4bNax3mvm571AsnC8jtRusVwQLXsEzhjxRPyvW2ko=; b=Z5hf15GFbf+W2nOvyN1rzFhwl7qYyWx8FNanj2xrhdrIAbXglqmtTI1J1//pJTLltY 3KSxGQ83DcGg1dm/OE5tC/QBgSyuLvc4GaIFYugCtCazj6GoHIIAg2IeiGLjIQqG4SQq eD8/49Pum4pouWkYmq88GdEg/+X740kqiPNSWJMFcgRYDl+N/VOJJxVrLSeEjKftmsKV luFa9UBvQ44h7goHztweBPiPoU1tGbwSR33+lxI03q732a8v3augA+E1sDitOpIBThLO 9Xki6E6LJG6I/XgzSDhHN9t3yNWVdSjN429f6xPFZxxFyRxMcD4xtSnf1Ozm1eIzwwGk hQVA== X-Forwarded-Encrypted: i=1; AJvYcCV5Lhops11epEVjajrVZovVxbTMLC6CIqyJYl3cGlYcqu0A/UxTZmRd6KqXO/+2WPOnRPInFAr8vB8=@passt.top X-Gm-Message-State: AOJu0YykJciY0iYzVXyPO8yw2uS36xzq9LLyKBi26nYWrNjL+eGo3ZUK QU1BpWtLG/ggKCZ7tN+3yOep0JgSmD9vqUHgMBD/IznUXUKUnnQZg0ohOnXsVys48DTFTWQPCUc P+6jOYQWhXtubpHlr7jBYdi4pQLo= X-Gm-Gg: ASbGnctXUf4z8/47ig1/8bgg/oF1eyOsSGQFGrHmy4jXalMVBG2GLqVmWK16KJLeM6M /d2jF/jJmxgCfC4zp4qBj0L137xbGNtzpJZPomwq3DCsP8+n8XVFs5HAaeZgMGw== X-Google-Smtp-Source: AGHT+IHo/v27RytTX8hu0X01pjjFZoduu3VL7UiwWJ9Rz9WpDWaU0bUj6ukOEQxOxclkg+jfZmvUYJYbb81l82UaZh0= X-Received: by 2002:a05:6902:2e01:b0:e58:30c9:c684 with SMTP id 3f1490d57ef6-e5830c9c815mr11186277276.15.1737985103294; Mon, 27 Jan 2025 05:38:23 -0800 (PST) MIME-Version: 1.0 References: <20250117214035.2414668-1-jmaloy@redhat.com> <20250127110121.1f53b27d@elisabeth> <20250127113214.294bcafb@elisabeth> In-Reply-To: <20250127113214.294bcafb@elisabeth> From: Menglong Dong Date: Mon, 27 Jan 2025 21:37:23 +0800 X-Gm-Features: AWEUYZlU6knkjXO45irCari8w8PtWQgQolh7IjF_Vp7DArLkHK4uj8TC84dXQwI Message-ID: Subject: Re: [net,v2] tcp: correct handling of extreme memory squeeze To: Stefano Brivio Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: menglong8.dong@gmail.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: PNHRHKY5WOC23FC6XODYAL4RBKBGMXDR X-Message-ID-Hash: PNHRHKY5WOC23FC6XODYAL4RBKBGMXDR X-Mailman-Approved-At: Mon, 27 Jan 2025 18:12:04 +0100 CC: Jason Xing , Jon Maloy , Eric Dumazet , Neal Cardwell , netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com, eric.dumazet@gmail.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, Jan 27, 2025 at 6:32=E2=80=AFPM Stefano Brivio = wrote: > > On Mon, 27 Jan 2025 18:17:28 +0800 > Jason Xing wrote: > > > I'm not that sure if it's a bug belonging to the Linux kernel. > > It is, because for at least 20-25 years (before that it's a bit hard to > understand from history) a non-zero window would be announced, as > obviously expected, once there's again space in the receive window. Sorry for the late reply. I think the key of this problem is what should we do when we receive a tcp packet and we are out of memory. The RFC doesn't define such a thing, so in the commit e2142825c120 ("net: tcp: send zero-window ACK when no memory"), I reply with a zero-window ACK to the peer. And the peer will keep probing the window by retransmitting the packet that we dropped if the peer is a LINUX SYSTEM. As I said, the RFC doesn't define such a case, so the behavior of the peer is undefined if it is not a LINUX SYSTEM. If the peer doesn't keep retransmitting the packet, it will hang the connection, just like the problem that described in this commit log. However, we can make some optimization to make it more adaptable. We can send a ACK with the right window to the peer when the memory is available, and __tcp_cleanup_rbuf() is a good choice. Generally speaking, I think this patch makes sense. However, I'm not sure if there is any other influence if we make "tp->rcv_wnd=3D0", but it can trigger a ACK in __tcp_cleanup_rbuf(). Following is the code that I thought before to optimize this case (the code is totally not tested): diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 3c82fad904d4..bedd78946762 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -116,7 +116,8 @@ struct inet_connection_sock { #define ATO_BITS 8 __u32 ato:ATO_BITS, /* Predicted tick of soft clock */ lrcv_flowlabel:20, /* last received ipv6 flowlabel = */ - unused:4; + is_oom:1, + unused:3; unsigned long timeout; /* Currently scheduled timeout */ __u32 lrcvtime; /* timestamp of last received data packet */ __u16 last_seg_size; /* Size of last incoming segment = */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0d704bda6c41..6f3c85a1f4da 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1458,11 +1458,11 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) */ void __tcp_cleanup_rbuf(struct sock *sk, int copied) { + struct inet_connection_sock *icsk =3D inet_csk(sk); struct tcp_sock *tp =3D tcp_sk(sk); bool time_to_ack =3D false; if (inet_csk_ack_scheduled(sk)) { - const struct inet_connection_sock *icsk =3D inet_csk(sk); if (/* Once-per-two-segments ACK was not sent by tcp_input.c */ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || @@ -1502,6 +1502,11 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) time_to_ack =3D true; } } + if (unlikely(icsk->icsk_ack.is_oom)) { + icsk->icsk_ack.is_oom =3D false; + time_to_ack =3D true; + } + if (time_to_ack) tcp_send_ack(sk); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0e5b9a654254..e2d65213b3b7 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -268,9 +268,12 @@ static u16 tcp_select_window(struct sock *sk) * are out of memory. The window is temporary, so we don't store * it on the socket. */ - if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) + if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) { + inet_csk(sk)->icsk_ack.is_oom =3D true; return 0; + } + inet_csk(sk)->icsk_ack.is_oom =3D false; cur_win =3D tcp_receive_window(tp); new_win =3D __tcp_select_window(sk); if (new_win < cur_win) { > > > The other side not sending a window probe causes this issue...? > > It doesn't cause this issue, but it triggers it. > > > The other part of me says we cannot break the user's behaviour. > > This sounds quite relevant, yes. > > -- > Stefano >