From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: passt.top; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20230601 header.b=ojeMRkLz; dkim-atps=neutral Received: from mail-ed1-x531.google.com (mail-ed1-x531.google.com [IPv6:2a00:1450:4864:20::531]) by passt.top (Postfix) with ESMTPS id B387B5A0274 for ; Mon, 20 Jan 2025 17:23:04 +0100 (CET) Received: by mail-ed1-x531.google.com with SMTP id 4fb4d7f45d1cf-5d3d0205bd5so7466672a12.3 for ; Mon, 20 Jan 2025 08:23:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737390184; x=1737994984; darn=passt.top; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3Qmoe3LxIhHvxM/+eymvqxcBkLWxBUxLX+UpNauVonI=; b=ojeMRkLzRvZM/T2v1UQFdfSJ5DtAuimRrXuxk8mVBxNfEHOYIiy4VVBFhYL2mPZn8Q HfcSIoEEXy6PLkEMb2HodsBiISErVxx2UH2bVlzoL3cnR7nekBnQMQgLtpu4pBYX6Nbp ZkGPjl3qSyWqaBH4KVjxTjIpnBJxa6w0BeXm+GrMTeAz9LHUX3XyvMQdfTA7EeuZ9UGT /B+ErI6nLY/LBEq8YYkzOeL80bA12YktRNhS7zJxVanIKcfu0saT7yLWIPDNIsWRkC2r adBxq65y2S7qnOHDObQVH2zTx3of7HIQX5+4wXP+QdRLGErmtNOEjzRVrfLOn3ox7K7n 6FGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737390184; x=1737994984; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3Qmoe3LxIhHvxM/+eymvqxcBkLWxBUxLX+UpNauVonI=; b=GcmiBhnih1wMZgOgMw22fkIYpAJaloGImmMg3MVFTpyW8k7FunC3Xd0x4Ev2A+I14s P6JGogyKQm3sF8QwB7pBIiasa+xmXdHxa0vtA+sM0PIaZSvx6fus6qUSxvmhB6enaC3O Wb0xM+eYiX7/UEtfdvk4gqTWCmvyyVQc+Xy/rP2BBIdcO1zqB4amZ44p5M24vDB5pljN j3MxMR1l9nECjo2+4Jsa+uiU0aQmzgTj60FZIe/EXLVCLfhQmnfs2c8aPKXdMHQ9LRrj YjBf1Qa+Lde70r8cq8n/St6bdyzN4pDeQj/U1rhgXCP2iZAap/6sn0/YCQgTUZoNqDcY w71A== X-Forwarded-Encrypted: i=1; AJvYcCVqTYVInz40CZUH+qpYBYZ1JllVVPL9NOl60765CazoESF7dOIH1DI7sZ+cKykr77ZiubIZl6OBbl8=@passt.top X-Gm-Message-State: AOJu0YyXfCWAVpFsOfkI5P8zSIt5ZDQ8aN+zW85+g4fYH9rpEq1ylidZ lot7u8BBKTxa0dYL0tkK7o3yXeyOVNr2BODhoqMkFCdPLoxmsKstSAo9FnbWK4Dp8AC9H+JMwP3 qlbAhePkgGP8JTPIotsf0lFY/v+C9mqKwwb4H X-Gm-Gg: ASbGncv18PzpeQ0eOG7ZZd5QKkbgnQUb5OJsHB4aaU1DZdDX7PtlogrEHGIsvhPRilT VLVTW30Y1KVY/J2uEDuln4qR3ai1wDI6WOnmpmkkNyBCVqYHPUQ== X-Google-Smtp-Source: AGHT+IFAJciDlGAVyRLEZgxxRTitH0IfVwmIO26PFgeze0pAH6+/HEy35YhMKKevXZj/y6eHU5MVrLj4+bBEqgh3L1E= X-Received: by 2002:a05:6402:2686:b0:5db:67a7:e741 with SMTP id 4fb4d7f45d1cf-5db7d2f5e87mr10197138a12.8.1737390184014; Mon, 20 Jan 2025 08:23:04 -0800 (PST) MIME-Version: 1.0 References: <20250117214035.2414668-1-jmaloy@redhat.com> In-Reply-To: From: Eric Dumazet Date: Mon, 20 Jan 2025 17:22:52 +0100 X-Gm-Features: AbW1kvZ-kTg7Cw5oio-4BZkrdrJQpmBk5EMkpQsT4vVSq6RMZpBCQ3GEnhYOt9w Message-ID: Subject: Re: [net,v2] tcp: correct handling of extreme memory squeeze To: Jon Maloy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: edumazet@google.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: FBLVC4RLD6EKI2NDZ7UHPYAJPNEXHLIR X-Message-ID-Hash: FBLVC4RLD6EKI2NDZ7UHPYAJPNEXHLIR X-Mailman-Approved-At: Mon, 20 Jan 2025 17:53:09 +0100 CC: Neal Cardwell , netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, eric.dumazet@gmail.com, Menglong Dong X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, Jan 20, 2025 at 5:10=E2=80=AFPM Jon Maloy wrote= : > > > > On 2025-01-20 00:03, Jon Maloy wrote: > > > > > > On 2025-01-18 15:04, Neal Cardwell wrote: > >> On Fri, Jan 17, 2025 at 4:41=E2=80=AFPM wrote: > >>> > >>> From: Jon Maloy > >>> > >>> Testing with iperf3 using the "pasta" protocol splicer has revealed > >>> a bug in the way tcp handles window advertising in extreme memory > >>> squeeze situations. > >>> > >>> Under memory pressure, a socket endpoint may temporarily advertise > >>> a zero-sized window, but this is not stored as part of the socket dat= a. > >>> The reasoning behind this is that it is considered a temporary settin= g > >>> which shouldn't influence any further calculations. > >>> > >>> However, if we happen to stall at an unfortunate value of the current > >>> window size, the algorithm selecting a new value will consistently fa= il > >>> to advertise a non-zero window once we have freed up enough memory. > >> > >> The "if we happen to stall at an unfortunate value of the current > >> window size" phrase is a little vague... :-) Do you have a sense of > >> what might count as "unfortunate" here? That might help in crafting a > >> packetdrill test to reproduce this and have an automated regression > >> test. > > > > Obviously, it happens when the following code snippet in > > > > __tcp_cleanup_rbuf() { > > [....] > > if (copied > 0 && !time_to_ack && > > !(sk->sk_shutdown & RCV_SHUTDOWN)) { > > __u32 rcv_window_now =3D tcp_receive_window(tp); > > > > /* Optimize, __tcp_select_window() is not cheap. */ > > if (2*rcv_window_now <=3D tp->window_clamp) { > > __u32 new_window =3D __tcp_select_window(sk); > > > > /* Send ACK now, if this read freed lots of space > > * in our buffer. Certainly, new_window is new window. > > * We can advertise it now, if it is not less than > > * current one. > > * "Lots" means "at least twice" here. > > */ > > if (new_window && new_window >=3D 2 * rcv_window_now) > > time_to_ack =3D true; > > } > > } > > [....] > > } > > > > yields time_to_ack =3D false, i.e. __tcp_select_window(sk) returns > > a value new_window < (2 * tcp_receive_window(tp)). > > > > In my log I have for brevity used the following names: > > > > win_now: same as rcv_window_now > > (=3D tcp_receive_window(tp), > > =3D tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt, > > =3D 265469200 + 262144 - 265600160, > > =3D 131184) > > > > new_win: same as new_window > > (=3D __tcp_select_window(sk), > > =3D 0 first time, later 262144 ) > > > > rcv_wnd: same as tp->rcv_wnd, > > (=3D262144) > > > > We see that although the last test actually is pretty close > > (262144 >=3D 262368 ? =3D> false) it is not close enough. > > > > > > We also notice that > > (tp->rcv_nxt - tp->rcv_wup) =3D (265600160 - 265469200) =3D 130960. > > 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf(): > > (new_window >=3D 2 * rcv_window_now) will always be false. > > > > > > Too me it looks like __tcp_select_window(sk) doesn't at all take the > > freed-up memory into account when calculating a new window. I haven't > > looked into why that is happening. > > > >> > >>> This means that this side's notion of the current window size is > >>> different from the one last advertised to the peer, causing the latte= r > >>> to not send any data to resolve the sitution. > >> > >> Since the peer last saw a zero receive window at the time of the > >> memory-pressure drop, shouldn't the peer be sending repeated zero > >> window probes, and shouldn't the local host respond to a ZWP with an > >> ACK with the correct non-zero window? > > > > It should, but at the moment when I found this bug the peer stack was > > not the Linux kernel stack, but one we develop for our own purpose. We > > fixed that later, but it still means that traffic stops for a couple of > > seconds now and then before the timer restarts the flow. This happens > > too often for comfort in our usage scenarios. > > We can of course blame the the peer stack, but I still feel this is a > > bug, and that it could be handled better by the kernel stack. > >> > >> Do you happen to have a tcpdump .pcap of one of these cases that you > >> can share? > > > > I had one, although not for this particular run, and I cannot find it > > right now. I will continue looking or make a new one. Is there some > > shared space I can put it? > > Here it is. Look at frame #1067. > > https://passt.top/static/iperf3_jon_zero_window_cut.pcap > > > >> > >>> The problem occurs on the iperf3 server side, and the socket in quest= ion > >>> is a completely regular socket with the default settings for the > >>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. > >>> > >>> The following excerpt of a logging session, with own comments added, > >>> shows more in detail what is happening: > >>> > >>> // tcp_v4_rcv(->) > >>> // tcp_rcv_established(->) > >>> [5201<->39222]: =3D=3D=3D=3D Activating log @ net/ipv4/tcp_input.= c/ > >>> tcp_data_queue()/5257 =3D=3D=3D=3D > >>> [5201<->39222]: tcp_data_queue(->) > >>> [5201<->39222]: DROPPING skb [265600160..265665640], reason: > >>> SKB_DROP_REASON_PROTO_MEM > >>> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack > >>> 265469200, win_now 131184] > >> > >> What is "win_now"? That doesn't seem to correspond to any variable > >> name in the Linux source tree. > > > > See above. > > > > Can this be renamed to the > >> tcp_select_window() variable it is printing, like "cur_win" or > >> "effective_win" or "new_win", etc? > >> > >> Or perhaps you can attach your debugging patch in some email thread? I > >> agree with Eric that these debug dumps are a little hard to parse > >> without seeing the patch that allows us to understand what some of > >> these fields are... > >> > >> I agree with Eric that probably tp->pred_flags should be cleared, and > >> a packetdrill test for this would be super-helpful. > > > > I must admit I have never used packetdrill, but I can make an effort. > > I hear from other sources that you cannot force a memory exhaustion with > packetdrill anyway, so this sounds like a pointless exercise. We certainly can and should add a feature like that to packetdrill. Documentation/fault-injection/ has some relevant information. Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf that could be lowered by a packetdrill script I think.