public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Menglong Dong <menglong8.dong@gmail.com>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>,
	Jon Maloy <jmaloy@redhat.com>, Eric Dumazet <edumazet@google.com>,
	Neal Cardwell <ncardwell@google.com>,
	netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org,
	passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com,
	eric.dumazet@gmail.com
Subject: Re: [net,v2] tcp: correct handling of extreme memory squeeze
Date: Mon, 27 Jan 2025 21:37:23 +0800	[thread overview]
Message-ID: <CADxym3Zji3NZy2tBAxSm5GaQ8tVG8PmxcyJ_AGnUC-H386tq7g@mail.gmail.com> (raw)
In-Reply-To: <20250127113214.294bcafb@elisabeth>

On Mon, Jan 27, 2025 at 6:32 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Mon, 27 Jan 2025 18:17:28 +0800
> Jason Xing <kerneljasonxing@gmail.com> wrote:
>
> > I'm not that sure if it's a bug belonging to the Linux kernel.
>
> It is, because for at least 20-25 years (before that it's a bit hard to
> understand from history) a non-zero window would be announced, as
> obviously expected, once there's again space in the receive window.

Sorry for the late reply. I think the key of this problem is
what should we do when we receive a tcp packet and we are
out of memory.

The RFC doesn't define such a thing, so in the commit
e2142825c120 ("net: tcp: send zero-window ACK when no memory"),
I reply with a zero-window ACK to the peer. And the peer will keep
probing the window by retransmitting the packet that we dropped if
the peer is a LINUX SYSTEM.

As I said, the RFC doesn't define such a case, so the behavior of
the peer is undefined if it is not a LINUX SYSTEM. If the peer doesn't
keep retransmitting the packet, it will hang the connection, just like
the problem that described in this commit log.

However, we can make some optimization to make it more
adaptable. We can send a ACK with the right window to the
peer when the memory is available, and __tcp_cleanup_rbuf()
is a good choice.

Generally speaking, I think this patch makes sense. However,
I'm not sure if there is any other influence if we make
"tp->rcv_wnd=0", but it can trigger a ACK in __tcp_cleanup_rbuf().

Following is the code that I thought before to optimize this
case (the code is totally not tested):

diff --git a/include/net/inet_connection_sock.h
b/include/net/inet_connection_sock.h
index 3c82fad904d4..bedd78946762 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -116,7 +116,8 @@ struct inet_connection_sock {
         #define ATO_BITS 8
         __u32          ato:ATO_BITS,     /* Predicted tick of soft
clock       */
                   lrcv_flowlabel:20, /* last received ipv6 flowlabel       */
-                  unused:4;
+                  is_oom:1,
+                  unused:3;
         unsigned long      timeout;     /* Currently scheduled
timeout           */
         __u32          lrcvtime;     /* timestamp of last received
data packet */
         __u16          last_seg_size; /* Size of last incoming segment       */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..6f3c85a1f4da 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1458,11 +1458,11 @@ static int tcp_peek_sndq(struct sock *sk,
struct msghdr *msg, int len)
  */
 void __tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
+    struct inet_connection_sock *icsk = inet_csk(sk);
     struct tcp_sock *tp = tcp_sk(sk);
     bool time_to_ack = false;

     if (inet_csk_ack_scheduled(sk)) {
-        const struct inet_connection_sock *icsk = inet_csk(sk);

         if (/* Once-per-two-segments ACK was not sent by tcp_input.c */
             tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
@@ -1502,6 +1502,11 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied)
                 time_to_ack = true;
         }
     }
+    if (unlikely(icsk->icsk_ack.is_oom)) {
+        icsk->icsk_ack.is_oom = false;
+        time_to_ack = true;
+    }
+
     if (time_to_ack)
         tcp_send_ack(sk);
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..e2d65213b3b7 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -268,9 +268,12 @@ static u16 tcp_select_window(struct sock *sk)
      * are out of memory. The window is temporary, so we don't store
      * it on the socket.
      */
-    if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+    if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+        inet_csk(sk)->icsk_ack.is_oom = true;
         return 0;
+    }

+    inet_csk(sk)->icsk_ack.is_oom = false;
     cur_win = tcp_receive_window(tp);
     new_win = __tcp_select_window(sk);
     if (new_win < cur_win) {

>
> > The other side not sending a window probe causes this issue...?
>
> It doesn't cause this issue, but it triggers it.
>
> > The other part of me says we cannot break the user's behaviour.
>
> This sounds quite relevant, yes.
>
> --
> Stefano
>

  reply	other threads:[~2025-01-27 13:38 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-17 21:40 [net,v2] tcp: correct handling of extreme memory squeeze jmaloy
2025-01-17 22:09 ` Eric Dumazet
2025-01-17 22:27   ` Stefano Brivio
2025-01-18 17:01 ` Jason Xing
2025-01-18 20:04 ` Neal Cardwell
2025-01-20  5:03   ` Jon Maloy
2025-01-20 16:10     ` Jon Maloy
2025-01-20 16:22       ` Eric Dumazet
2025-01-24 17:40         ` Jon Maloy
2025-01-27  9:53           ` Eric Dumazet
2025-01-27 10:01           ` Stefano Brivio
2025-01-27 10:06             ` Eric Dumazet
2025-01-27 10:27               ` Stefano Brivio
2025-01-27 10:17             ` Jason Xing
2025-01-27 10:32               ` Stefano Brivio
2025-01-27 13:37                 ` Menglong Dong [this message]
2025-01-27 14:03                   ` Stefano Brivio
2025-01-27 16:37           ` Eric Dumazet
  -- strict thread matches above, loose matches on Subject: below --
2025-01-16  2:29 [net, v2] " Jon Maloy
2025-01-16 21:14 ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADxym3Zji3NZy2tBAxSm5GaQ8tVG8PmxcyJ_AGnUC-H386tq7g@mail.gmail.com \
    --to=menglong8.dong@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dgibson@redhat.com \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=jmaloy@redhat.com \
    --cc=kerneljasonxing@gmail.com \
    --cc=kuba@kernel.org \
    --cc=lvivier@redhat.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).