From: Jon Maloy <jmaloy@redhat.com>
To: Neal Cardwell <ncardwell@google.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org,
passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
dgibson@redhat.com, imagedong@tencent.com,
eric.dumazet@gmail.com, edumazet@google.com
Subject: Re: [net,v2] tcp: correct handling of extreme memory squeeze
Date: Mon, 20 Jan 2025 00:03:08 -0500 [thread overview]
Message-ID: <afb9ff14-a2f1-4c5a-a920-bce0105a7d41@redhat.com> (raw)
In-Reply-To: <CADVnQymiwUG3uYBGMc1ZEV9vAUQzEOD4ymdN7Rcqi7yAK9ZB5A@mail.gmail.com>
On 2025-01-18 15:04, Neal Cardwell wrote:
> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
>>
>> From: Jon Maloy <jmaloy@redhat.com>
>>
>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>> a bug in the way tcp handles window advertising in extreme memory
>> squeeze situations.
>>
>> Under memory pressure, a socket endpoint may temporarily advertise
>> a zero-sized window, but this is not stored as part of the socket data.
>> The reasoning behind this is that it is considered a temporary setting
>> which shouldn't influence any further calculations.
>>
>> However, if we happen to stall at an unfortunate value of the current
>> window size, the algorithm selecting a new value will consistently fail
>> to advertise a non-zero window once we have freed up enough memory.
>
> The "if we happen to stall at an unfortunate value of the current
> window size" phrase is a little vague... :-) Do you have a sense of
> what might count as "unfortunate" here? That might help in crafting a
> packetdrill test to reproduce this and have an automated regression
> test.
Obviously, it happens when the following code snippet in
__tcp_cleanup_rbuf() {
[....]
if (copied > 0 && !time_to_ack &&
!(sk->sk_shutdown & RCV_SHUTDOWN)) {
__u32 rcv_window_now = tcp_receive_window(tp);
/* Optimize, __tcp_select_window() is not cheap. */
if (2*rcv_window_now <= tp->window_clamp) {
__u32 new_window = __tcp_select_window(sk);
/* Send ACK now, if this read freed lots of space
* in our buffer. Certainly, new_window is new window.
* We can advertise it now, if it is not less than
* current one.
* "Lots" means "at least twice" here.
*/
if (new_window && new_window >= 2 * rcv_window_now)
time_to_ack = true;
}
}
[....]
}
yields time_to_ack = false, i.e. __tcp_select_window(sk) returns
a value new_window < (2 * tcp_receive_window(tp)).
In my log I have for brevity used the following names:
win_now: same as rcv_window_now
(= tcp_receive_window(tp),
= tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
= 265469200 + 262144 - 265600160,
= 131184)
new_win: same as new_window
(= __tcp_select_window(sk),
= 0 first time, later 262144 )
rcv_wnd: same as tp->rcv_wnd,
(=262144)
We see that although the last test actually is pretty close
(262144 >= 262368 ? => false) it is not close enough.
We also notice that
(tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
(new_window >= 2 * rcv_window_now) will always be false.
Too me it looks like __tcp_select_window(sk) doesn't at all take the
freed-up memory into account when calculating a new window. I haven't
looked into why that is happening.
>
>> This means that this side's notion of the current window size is
>> different from the one last advertised to the peer, causing the latter
>> to not send any data to resolve the sitution.
>
> Since the peer last saw a zero receive window at the time of the
> memory-pressure drop, shouldn't the peer be sending repeated zero
> window probes, and shouldn't the local host respond to a ZWP with an
> ACK with the correct non-zero window?
It should, but at the moment when I found this bug the peer stack was
not the Linux kernel stack, but one we develop for our own purpose. We
fixed that later, but it still means that traffic stops for a couple of
seconds now and then before the timer restarts the flow. This happens
too often for comfort in our usage scenarios.
We can of course blame the the peer stack, but I still feel this is a
bug, and that it could be handled better by the kernel stack.
>
> Do you happen to have a tcpdump .pcap of one of these cases that you can share?
I had one, although not for this particular run, and I cannot find it
right now. I will continue looking or make a new one. Is there some
shared space I can put it?
>
>> The problem occurs on the iperf3 server side, and the socket in question
>> is a completely regular socket with the default settings for the
>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>
>> The following excerpt of a logging session, with own comments added,
>> shows more in detail what is happening:
>>
>> // tcp_v4_rcv(->)
>> // tcp_rcv_established(->)
>> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
>> [5201<->39222]: tcp_data_queue(->)
>> [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>
> What is "win_now"? That doesn't seem to correspond to any variable
> name in the Linux source tree.
See above.
Can this be renamed to the
> tcp_select_window() variable it is printing, like "cur_win" or
> "effective_win" or "new_win", etc?
>
> Or perhaps you can attach your debugging patch in some email thread? I
> agree with Eric that these debug dumps are a little hard to parse
> without seeing the patch that allows us to understand what some of
> these fields are...
>
> I agree with Eric that probably tp->pred_flags should be cleared, and
> a packetdrill test for this would be super-helpful.
I must admit I have never used packetdrill, but I can make an effort.
///jon
>
> thanks,
> neal
>
next prev parent reply other threads:[~2025-01-20 5:03 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-17 21:40 [net,v2] tcp: correct handling of extreme memory squeeze jmaloy
2025-01-17 22:09 ` Eric Dumazet
2025-01-17 22:27 ` Stefano Brivio
2025-01-18 17:01 ` Jason Xing
2025-01-18 20:04 ` Neal Cardwell
2025-01-20 5:03 ` Jon Maloy [this message]
2025-01-20 16:10 ` Jon Maloy
2025-01-20 16:22 ` Eric Dumazet
-- strict thread matches above, loose matches on Subject: below --
2025-01-16 2:29 [net, v2] " Jon Maloy
2025-01-16 21:14 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=afb9ff14-a2f1-4c5a-a920-bce0105a7d41@redhat.com \
--to=jmaloy@redhat.com \
--cc=davem@davemloft.net \
--cc=dgibson@redhat.com \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=imagedong@tencent.com \
--cc=kuba@kernel.org \
--cc=lvivier@redhat.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).