* [net, v2] tcp: correct handling of extreme memory squeeze
@ 2025-01-16 2:29 Jon Maloy
2025-01-16 21:14 ` Stefano Brivio
0 siblings, 1 reply; 10+ messages in thread
From: Jon Maloy @ 2025-01-16 2:29 UTC (permalink / raw)
To: passt-dev, sbrivio, lvivier, dgibson, jmaloy
Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations. The problem occurs on the server side, and
the socket in question is a completely regular socket with the
default settings for the fedora40 kernel. We do not use SO_PEEK
or SO_RCV_BUF on this socket.
The following excerpt of a logging session, with own comments added,
shows what is happening:
// tcp_v4_rcv(<-)
// tcp_rcv_established(<-)
[5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]: tcp_data_queue(->)
[5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
[copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
[5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
[5201<->39222]: __tcp_transmit_skb(->)
[5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
[5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)
// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
[copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
[copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
// The above pattern repeats again and again, since nothing changes
// between the reads.
[...]
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
[copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200 (0), win_now 131184]
[copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
// The receive queue is empty, but no new advertisement is sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.
Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.
We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
has been dropped because of memory exhaustion and we have to advertize
a zero window.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.
Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v1: -Posted on Apr 6, 2024
v2: -Improved commit log to clarify how we end up in this situation.
-After feedback from Eric Dumazet, removed references to use of
SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
about how this situation occurs. Those flags are used at the
peer side's incoming connection, and not on this one.
---
net/ipv4/tcp_output.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..ba295f798e5e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
u32 cur_win, new_win;
/* Make the window 0 if we failed to queue the data because we
- * are out of memory. The window is temporary, so we don't store
- * it on the socket.
+ * are out of memory.
*/
- if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+ if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+ tp->rcv_wnd = 0;
+ tp->rcv_wup = tp->rcv_nxt;
return 0;
+ }
cur_win = tcp_receive_window(tp);
new_win = __tcp_select_window(sk);
--
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
u32 cur_win, new_win;
/* Make the window 0 if we failed to queue the data because we
- * are out of memory. The window is temporary, so we don't store
- * it on the socket.
+ * are out of memory.
*/
- if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+ if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+ tp->rcv_wnd = 0;
+ tp->rcv_wup = tp->rcv_nxt;
return 0;
+ }
cur_win = tcp_receive_window(tp);
new_win = __tcp_select_window(sk);
--
2.48.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [net, v2] tcp: correct handling of extreme memory squeeze
2025-01-16 2:29 [net, v2] tcp: correct handling of extreme memory squeeze Jon Maloy
@ 2025-01-16 21:14 ` Stefano Brivio
0 siblings, 0 replies; 10+ messages in thread
From: Stefano Brivio @ 2025-01-16 21:14 UTC (permalink / raw)
To: Jon Maloy; +Cc: passt-dev, lvivier, dgibson
On Wed, 15 Jan 2025 21:29:18 -0500
Jon Maloy <jmaloy@redhat.com> wrote:
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations. The problem occurs on the server side, and
> the socket in question is a completely regular socket with the
> default settings for the fedora40 kernel. We do not use SO_PEEK
> or SO_RCV_BUF on this socket.
SO_RCVBUF
I think what's missing, at this point, is a short description of the
issue.
That is (if I recall correctly), under memory pressure, we temporarily
advertise a zero-sized window, but this is not stored as part of socket
data, the reasoning being that it's a temporary setting which shouldn't
influence any further calculation.
However, by not storing this information in socket data, we also fail
to advertise a non-zero window once we have enough memory to receive
data.
Our notion of the window size is now different from what we
advertised to the peer. The peer won't send any data as a result, but
we won't recalculate the window, either, as long as we don't receive
any.
The rest looks good and very reasonable to me.
--
Stefano
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [net,v2] tcp: correct handling of extreme memory squeeze
2025-01-20 16:10 ` Jon Maloy
@ 2025-01-20 16:22 ` Eric Dumazet
0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2025-01-20 16:22 UTC (permalink / raw)
To: Jon Maloy
Cc: Neal Cardwell, netdev, davem, kuba, passt-dev, sbrivio, lvivier,
dgibson, eric.dumazet, Menglong Dong
On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy@redhat.com> wrote:
>
>
>
> On 2025-01-20 00:03, Jon Maloy wrote:
> >
> >
> > On 2025-01-18 15:04, Neal Cardwell wrote:
> >> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
> >>>
> >>> From: Jon Maloy <jmaloy@redhat.com>
> >>>
> >>> Testing with iperf3 using the "pasta" protocol splicer has revealed
> >>> a bug in the way tcp handles window advertising in extreme memory
> >>> squeeze situations.
> >>>
> >>> Under memory pressure, a socket endpoint may temporarily advertise
> >>> a zero-sized window, but this is not stored as part of the socket data.
> >>> The reasoning behind this is that it is considered a temporary setting
> >>> which shouldn't influence any further calculations.
> >>>
> >>> However, if we happen to stall at an unfortunate value of the current
> >>> window size, the algorithm selecting a new value will consistently fail
> >>> to advertise a non-zero window once we have freed up enough memory.
> >>
> >> The "if we happen to stall at an unfortunate value of the current
> >> window size" phrase is a little vague... :-) Do you have a sense of
> >> what might count as "unfortunate" here? That might help in crafting a
> >> packetdrill test to reproduce this and have an automated regression
> >> test.
> >
> > Obviously, it happens when the following code snippet in
> >
> > __tcp_cleanup_rbuf() {
> > [....]
> > if (copied > 0 && !time_to_ack &&
> > !(sk->sk_shutdown & RCV_SHUTDOWN)) {
> > __u32 rcv_window_now = tcp_receive_window(tp);
> >
> > /* Optimize, __tcp_select_window() is not cheap. */
> > if (2*rcv_window_now <= tp->window_clamp) {
> > __u32 new_window = __tcp_select_window(sk);
> >
> > /* Send ACK now, if this read freed lots of space
> > * in our buffer. Certainly, new_window is new window.
> > * We can advertise it now, if it is not less than
> > * current one.
> > * "Lots" means "at least twice" here.
> > */
> > if (new_window && new_window >= 2 * rcv_window_now)
> > time_to_ack = true;
> > }
> > }
> > [....]
> > }
> >
> > yields time_to_ack = false, i.e. __tcp_select_window(sk) returns
> > a value new_window < (2 * tcp_receive_window(tp)).
> >
> > In my log I have for brevity used the following names:
> >
> > win_now: same as rcv_window_now
> > (= tcp_receive_window(tp),
> > = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
> > = 265469200 + 262144 - 265600160,
> > = 131184)
> >
> > new_win: same as new_window
> > (= __tcp_select_window(sk),
> > = 0 first time, later 262144 )
> >
> > rcv_wnd: same as tp->rcv_wnd,
> > (=262144)
> >
> > We see that although the last test actually is pretty close
> > (262144 >= 262368 ? => false) it is not close enough.
> >
> >
> > We also notice that
> > (tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
> > 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
> > (new_window >= 2 * rcv_window_now) will always be false.
> >
> >
> > Too me it looks like __tcp_select_window(sk) doesn't at all take the
> > freed-up memory into account when calculating a new window. I haven't
> > looked into why that is happening.
> >
> >>
> >>> This means that this side's notion of the current window size is
> >>> different from the one last advertised to the peer, causing the latter
> >>> to not send any data to resolve the sitution.
> >>
> >> Since the peer last saw a zero receive window at the time of the
> >> memory-pressure drop, shouldn't the peer be sending repeated zero
> >> window probes, and shouldn't the local host respond to a ZWP with an
> >> ACK with the correct non-zero window?
> >
> > It should, but at the moment when I found this bug the peer stack was
> > not the Linux kernel stack, but one we develop for our own purpose. We
> > fixed that later, but it still means that traffic stops for a couple of
> > seconds now and then before the timer restarts the flow. This happens
> > too often for comfort in our usage scenarios.
> > We can of course blame the the peer stack, but I still feel this is a
> > bug, and that it could be handled better by the kernel stack.
> >>
> >> Do you happen to have a tcpdump .pcap of one of these cases that you
> >> can share?
> >
> > I had one, although not for this particular run, and I cannot find it
> > right now. I will continue looking or make a new one. Is there some
> > shared space I can put it?
>
> Here it is. Look at frame #1067.
>
> https://passt.top/static/iperf3_jon_zero_window_cut.pcap
> >
> >>
> >>> The problem occurs on the iperf3 server side, and the socket in question
> >>> is a completely regular socket with the default settings for the
> >>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
> >>>
> >>> The following excerpt of a logging session, with own comments added,
> >>> shows more in detail what is happening:
> >>>
> >>> // tcp_v4_rcv(->)
> >>> // tcp_rcv_established(->)
> >>> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/
> >>> tcp_data_queue()/5257 ====
> >>> [5201<->39222]: tcp_data_queue(->)
> >>> [5201<->39222]: DROPPING skb [265600160..265665640], reason:
> >>> SKB_DROP_REASON_PROTO_MEM
> >>> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack
> >>> 265469200, win_now 131184]
> >>
> >> What is "win_now"? That doesn't seem to correspond to any variable
> >> name in the Linux source tree.
> >
> > See above.
> >
> > Can this be renamed to the
> >> tcp_select_window() variable it is printing, like "cur_win" or
> >> "effective_win" or "new_win", etc?
> >>
> >> Or perhaps you can attach your debugging patch in some email thread? I
> >> agree with Eric that these debug dumps are a little hard to parse
> >> without seeing the patch that allows us to understand what some of
> >> these fields are...
> >>
> >> I agree with Eric that probably tp->pred_flags should be cleared, and
> >> a packetdrill test for this would be super-helpful.
> >
> > I must admit I have never used packetdrill, but I can make an effort.
>
> I hear from other sources that you cannot force a memory exhaustion with
> packetdrill anyway, so this sounds like a pointless exercise.
We certainly can and should add a feature like that to packetdrill.
Documentation/fault-injection/ has some relevant information.
Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf
that could be lowered by a packetdrill script I think.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [net,v2] tcp: correct handling of extreme memory squeeze
2025-01-20 5:03 ` Jon Maloy
@ 2025-01-20 16:10 ` Jon Maloy
2025-01-20 16:22 ` Eric Dumazet
0 siblings, 1 reply; 10+ messages in thread
From: Jon Maloy @ 2025-01-20 16:10 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
eric.dumazet, edumazet, Menglong Dong
On 2025-01-20 00:03, Jon Maloy wrote:
>
>
> On 2025-01-18 15:04, Neal Cardwell wrote:
>> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
>>>
>>> From: Jon Maloy <jmaloy@redhat.com>
>>>
>>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>>> a bug in the way tcp handles window advertising in extreme memory
>>> squeeze situations.
>>>
>>> Under memory pressure, a socket endpoint may temporarily advertise
>>> a zero-sized window, but this is not stored as part of the socket data.
>>> The reasoning behind this is that it is considered a temporary setting
>>> which shouldn't influence any further calculations.
>>>
>>> However, if we happen to stall at an unfortunate value of the current
>>> window size, the algorithm selecting a new value will consistently fail
>>> to advertise a non-zero window once we have freed up enough memory.
>>
>> The "if we happen to stall at an unfortunate value of the current
>> window size" phrase is a little vague... :-) Do you have a sense of
>> what might count as "unfortunate" here? That might help in crafting a
>> packetdrill test to reproduce this and have an automated regression
>> test.
>
> Obviously, it happens when the following code snippet in
>
> __tcp_cleanup_rbuf() {
> [....]
> if (copied > 0 && !time_to_ack &&
> !(sk->sk_shutdown & RCV_SHUTDOWN)) {
> __u32 rcv_window_now = tcp_receive_window(tp);
>
> /* Optimize, __tcp_select_window() is not cheap. */
> if (2*rcv_window_now <= tp->window_clamp) {
> __u32 new_window = __tcp_select_window(sk);
>
> /* Send ACK now, if this read freed lots of space
> * in our buffer. Certainly, new_window is new window.
> * We can advertise it now, if it is not less than
> * current one.
> * "Lots" means "at least twice" here.
> */
> if (new_window && new_window >= 2 * rcv_window_now)
> time_to_ack = true;
> }
> }
> [....]
> }
>
> yields time_to_ack = false, i.e. __tcp_select_window(sk) returns
> a value new_window < (2 * tcp_receive_window(tp)).
>
> In my log I have for brevity used the following names:
>
> win_now: same as rcv_window_now
> (= tcp_receive_window(tp),
> = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
> = 265469200 + 262144 - 265600160,
> = 131184)
>
> new_win: same as new_window
> (= __tcp_select_window(sk),
> = 0 first time, later 262144 )
>
> rcv_wnd: same as tp->rcv_wnd,
> (=262144)
>
> We see that although the last test actually is pretty close
> (262144 >= 262368 ? => false) it is not close enough.
>
>
> We also notice that
> (tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
> 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
> (new_window >= 2 * rcv_window_now) will always be false.
>
>
> Too me it looks like __tcp_select_window(sk) doesn't at all take the
> freed-up memory into account when calculating a new window. I haven't
> looked into why that is happening.
>
>>
>>> This means that this side's notion of the current window size is
>>> different from the one last advertised to the peer, causing the latter
>>> to not send any data to resolve the sitution.
>>
>> Since the peer last saw a zero receive window at the time of the
>> memory-pressure drop, shouldn't the peer be sending repeated zero
>> window probes, and shouldn't the local host respond to a ZWP with an
>> ACK with the correct non-zero window?
>
> It should, but at the moment when I found this bug the peer stack was
> not the Linux kernel stack, but one we develop for our own purpose. We
> fixed that later, but it still means that traffic stops for a couple of
> seconds now and then before the timer restarts the flow. This happens
> too often for comfort in our usage scenarios.
> We can of course blame the the peer stack, but I still feel this is a
> bug, and that it could be handled better by the kernel stack.
>>
>> Do you happen to have a tcpdump .pcap of one of these cases that you
>> can share?
>
> I had one, although not for this particular run, and I cannot find it
> right now. I will continue looking or make a new one. Is there some
> shared space I can put it?
Here it is. Look at frame #1067.
https://passt.top/static/iperf3_jon_zero_window_cut.pcap
>
>>
>>> The problem occurs on the iperf3 server side, and the socket in question
>>> is a completely regular socket with the default settings for the
>>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>>
>>> The following excerpt of a logging session, with own comments added,
>>> shows more in detail what is happening:
>>>
>>> // tcp_v4_rcv(->)
>>> // tcp_rcv_established(->)
>>> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/
>>> tcp_data_queue()/5257 ====
>>> [5201<->39222]: tcp_data_queue(->)
>>> [5201<->39222]: DROPPING skb [265600160..265665640], reason:
>>> SKB_DROP_REASON_PROTO_MEM
>>> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack
>>> 265469200, win_now 131184]
>>
>> What is "win_now"? That doesn't seem to correspond to any variable
>> name in the Linux source tree.
>
> See above.
>
> Can this be renamed to the
>> tcp_select_window() variable it is printing, like "cur_win" or
>> "effective_win" or "new_win", etc?
>>
>> Or perhaps you can attach your debugging patch in some email thread? I
>> agree with Eric that these debug dumps are a little hard to parse
>> without seeing the patch that allows us to understand what some of
>> these fields are...
>>
>> I agree with Eric that probably tp->pred_flags should be cleared, and
>> a packetdrill test for this would be super-helpful.
>
> I must admit I have never used packetdrill, but I can make an effort.
I hear from other sources that you cannot force a memory exhaustion with
packetdrill anyway, so this sounds like a pointless exercise.
///jon
>
> ///jon
>
>>
>> thanks,
>> neal
>>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [net,v2] tcp: correct handling of extreme memory squeeze
2025-01-18 20:04 ` Neal Cardwell
@ 2025-01-20 5:03 ` Jon Maloy
2025-01-20 16:10 ` Jon Maloy
0 siblings, 1 reply; 10+ messages in thread
From: Jon Maloy @ 2025-01-20 5:03 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
imagedong, eric.dumazet, edumazet
On 2025-01-18 15:04, Neal Cardwell wrote:
> On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
>>
>> From: Jon Maloy <jmaloy@redhat.com>
>>
>> Testing with iperf3 using the "pasta" protocol splicer has revealed
>> a bug in the way tcp handles window advertising in extreme memory
>> squeeze situations.
>>
>> Under memory pressure, a socket endpoint may temporarily advertise
>> a zero-sized window, but this is not stored as part of the socket data.
>> The reasoning behind this is that it is considered a temporary setting
>> which shouldn't influence any further calculations.
>>
>> However, if we happen to stall at an unfortunate value of the current
>> window size, the algorithm selecting a new value will consistently fail
>> to advertise a non-zero window once we have freed up enough memory.
>
> The "if we happen to stall at an unfortunate value of the current
> window size" phrase is a little vague... :-) Do you have a sense of
> what might count as "unfortunate" here? That might help in crafting a
> packetdrill test to reproduce this and have an automated regression
> test.
Obviously, it happens when the following code snippet in
__tcp_cleanup_rbuf() {
[....]
if (copied > 0 && !time_to_ack &&
!(sk->sk_shutdown & RCV_SHUTDOWN)) {
__u32 rcv_window_now = tcp_receive_window(tp);
/* Optimize, __tcp_select_window() is not cheap. */
if (2*rcv_window_now <= tp->window_clamp) {
__u32 new_window = __tcp_select_window(sk);
/* Send ACK now, if this read freed lots of space
* in our buffer. Certainly, new_window is new window.
* We can advertise it now, if it is not less than
* current one.
* "Lots" means "at least twice" here.
*/
if (new_window && new_window >= 2 * rcv_window_now)
time_to_ack = true;
}
}
[....]
}
yields time_to_ack = false, i.e. __tcp_select_window(sk) returns
a value new_window < (2 * tcp_receive_window(tp)).
In my log I have for brevity used the following names:
win_now: same as rcv_window_now
(= tcp_receive_window(tp),
= tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt,
= 265469200 + 262144 - 265600160,
= 131184)
new_win: same as new_window
(= __tcp_select_window(sk),
= 0 first time, later 262144 )
rcv_wnd: same as tp->rcv_wnd,
(=262144)
We see that although the last test actually is pretty close
(262144 >= 262368 ? => false) it is not close enough.
We also notice that
(tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960.
130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf():
(new_window >= 2 * rcv_window_now) will always be false.
Too me it looks like __tcp_select_window(sk) doesn't at all take the
freed-up memory into account when calculating a new window. I haven't
looked into why that is happening.
>
>> This means that this side's notion of the current window size is
>> different from the one last advertised to the peer, causing the latter
>> to not send any data to resolve the sitution.
>
> Since the peer last saw a zero receive window at the time of the
> memory-pressure drop, shouldn't the peer be sending repeated zero
> window probes, and shouldn't the local host respond to a ZWP with an
> ACK with the correct non-zero window?
It should, but at the moment when I found this bug the peer stack was
not the Linux kernel stack, but one we develop for our own purpose. We
fixed that later, but it still means that traffic stops for a couple of
seconds now and then before the timer restarts the flow. This happens
too often for comfort in our usage scenarios.
We can of course blame the the peer stack, but I still feel this is a
bug, and that it could be handled better by the kernel stack.
>
> Do you happen to have a tcpdump .pcap of one of these cases that you can share?
I had one, although not for this particular run, and I cannot find it
right now. I will continue looking or make a new one. Is there some
shared space I can put it?
>
>> The problem occurs on the iperf3 server side, and the socket in question
>> is a completely regular socket with the default settings for the
>> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>>
>> The following excerpt of a logging session, with own comments added,
>> shows more in detail what is happening:
>>
>> // tcp_v4_rcv(->)
>> // tcp_rcv_established(->)
>> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
>> [5201<->39222]: tcp_data_queue(->)
>> [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
>> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
>
> What is "win_now"? That doesn't seem to correspond to any variable
> name in the Linux source tree.
See above.
Can this be renamed to the
> tcp_select_window() variable it is printing, like "cur_win" or
> "effective_win" or "new_win", etc?
>
> Or perhaps you can attach your debugging patch in some email thread? I
> agree with Eric that these debug dumps are a little hard to parse
> without seeing the patch that allows us to understand what some of
> these fields are...
>
> I agree with Eric that probably tp->pred_flags should be cleared, and
> a packetdrill test for this would be super-helpful.
I must admit I have never used packetdrill, but I can make an effort.
///jon
>
> thanks,
> neal
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [net,v2] tcp: correct handling of extreme memory squeeze
2025-01-17 21:40 [net,v2] " jmaloy
2025-01-17 22:09 ` Eric Dumazet
2025-01-18 17:01 ` Jason Xing
@ 2025-01-18 20:04 ` Neal Cardwell
2025-01-20 5:03 ` Jon Maloy
2 siblings, 1 reply; 10+ messages in thread
From: Neal Cardwell @ 2025-01-18 20:04 UTC (permalink / raw)
To: jmaloy
Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
imagedong, eric.dumazet, edumazet
On Fri, Jan 17, 2025 at 4:41 PM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
The "if we happen to stall at an unfortunate value of the current
window size" phrase is a little vague... :-) Do you have a sense of
what might count as "unfortunate" here? That might help in crafting a
packetdrill test to reproduce this and have an automated regression
test.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
Since the peer last saw a zero receive window at the time of the
memory-pressure drop, shouldn't the peer be sending repeated zero
window probes, and shouldn't the local host respond to a ZWP with an
ACK with the correct non-zero window?
Do you happen to have a tcpdump .pcap of one of these cases that you can share?
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> // tcp_v4_rcv(->)
> // tcp_rcv_established(->)
> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]: tcp_data_queue(->)
> [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
What is "win_now"? That doesn't seem to correspond to any variable
name in the Linux source tree. Can this be renamed to the
tcp_select_window() variable it is printing, like "cur_win" or
"effective_win" or "new_win", etc?
Or perhaps you can attach your debugging patch in some email thread? I
agree with Eric that these debug dumps are a little hard to parse
without seeing the patch that allows us to understand what some of
these fields are...
I agree with Eric that probably tp->pred_flags should be cleared, and
a packetdrill test for this would be super-helpful.
thanks,
neal
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [net,v2] tcp: correct handling of extreme memory squeeze
2025-01-17 21:40 [net,v2] " jmaloy
2025-01-17 22:09 ` Eric Dumazet
@ 2025-01-18 17:01 ` Jason Xing
2025-01-18 20:04 ` Neal Cardwell
2 siblings, 0 replies; 10+ messages in thread
From: Jason Xing @ 2025-01-18 17:01 UTC (permalink / raw)
To: jmaloy
Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
imagedong, eric.dumazet, edumazet
On Sat, Jan 18, 2025 at 5:41 AM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> // tcp_v4_rcv(->)
> // tcp_rcv_established(->)
> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]: tcp_data_queue(->)
> [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> [5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
> [5201<->39222]: __tcp_transmit_skb(->)
> [5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
> [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
>
> Furthermore, we have observed that in these situations this side may
> send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
> as it should be. Backing ack_seq seems to be harmless, but is of
> course still wrong from a protocol viewpoint.
>
> We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
> has been dropped because of memory exhaustion and we have to advertize
> a zero window.
>
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
Thanks for such a detailed explanation :) However, undeniably, I've
read several times until I totally follow the issue you reported. I
wonder if you can compact and simplify the commit message a little
bit, say, remove the last two calltraces that can be replaced by a
single sentence like you've already stated (which is "the queue is
empty").
Let me rephrase the scenario to see if I'm understanding in the right way:
1. First time the receiver sends a skb with zero window because of
shortage of memory.
2. Sometimes the user calls the tcp_recvmsg() to remove the data from
kernel to user space, sending the ACK process should be triggered,
especially when the queue is empty.
3. Then the client reading the zero window signal stops sending data.
#1 fails to update the rcv_wup and rcv_wnd, so since then in the
manner of __tcp_cleanup_rbuf(), the rcv_window_now variable is always
the same in this case (please see the calculation in
tcp_receive_window()). rcv_window_now, which I presume is the same as
win_now in the above description, should vary when
__tcp_cleanup_rbuf() is called. Because of that, #2 could never send
an ACK.
My small question is why is the new_window always 262144 no matter
what amount of data the user receives? Is it because in
__tcp_select_window() there is no wscale option?
Thanks,
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [net,v2] tcp: correct handling of extreme memory squeeze
2025-01-17 22:09 ` Eric Dumazet
@ 2025-01-17 22:27 ` Stefano Brivio
0 siblings, 0 replies; 10+ messages in thread
From: Stefano Brivio @ 2025-01-17 22:27 UTC (permalink / raw)
To: Eric Dumazet
Cc: jmaloy, netdev, davem, kuba, passt-dev, lvivier, dgibson,
Menglong Dong, eric.dumazet
[Fixed Cc: for Menglong Dong, this is a reply to:
https://lore.kernel.org/all/CANn89i+Ks52JVTBsMFQBM4CqUR4cegXhbSCH77aMCqFpd-S_1A@mail.gmail.com/]
On Fri, 17 Jan 2025 23:09:27 +0100
Eric Dumazet <edumazet@google.com> wrote:
> On Fri, Jan 17, 2025 at 10:40 PM <jmaloy@redhat.com> wrote:
>
> > v1: -Posted on Apr 6, 2024
>
> Could you post the link, this was a long time ago and I forgot the context.
https://lore.kernel.org/all/20240406182107.261472-1-jmaloy@redhat.com/#r
--
Stefano
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [net,v2] tcp: correct handling of extreme memory squeeze
2025-01-17 21:40 [net,v2] " jmaloy
@ 2025-01-17 22:09 ` Eric Dumazet
2025-01-17 22:27 ` Stefano Brivio
2025-01-18 17:01 ` Jason Xing
2025-01-18 20:04 ` Neal Cardwell
2 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2025-01-17 22:09 UTC (permalink / raw)
To: jmaloy
Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson,
imagedong, eric.dumazet
On Fri, Jan 17, 2025 at 10:40 PM <jmaloy@redhat.com> wrote:
>
> From: Jon Maloy <jmaloy@redhat.com>
>
> Testing with iperf3 using the "pasta" protocol splicer has revealed
> a bug in the way tcp handles window advertising in extreme memory
> squeeze situations.
>
> Under memory pressure, a socket endpoint may temporarily advertise
> a zero-sized window, but this is not stored as part of the socket data.
> The reasoning behind this is that it is considered a temporary setting
> which shouldn't influence any further calculations.
>
> However, if we happen to stall at an unfortunate value of the current
> window size, the algorithm selecting a new value will consistently fail
> to advertise a non-zero window once we have freed up enough memory.
> This means that this side's notion of the current window size is
> different from the one last advertised to the peer, causing the latter
> to not send any data to resolve the sitution.
>
> The problem occurs on the iperf3 server side, and the socket in question
> is a completely regular socket with the default settings for the
> fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
>
> The following excerpt of a logging session, with own comments added,
> shows more in detail what is happening:
>
> // tcp_v4_rcv(->)
> // tcp_rcv_established(->)
> [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
> [5201<->39222]: tcp_data_queue(->)
> [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
> [5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
> [5201<->39222]: __tcp_transmit_skb(->)
> [5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
> [5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
> [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160
> [5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_rcv_established(<-)
> [5201<->39222]: tcp_v4_rcv(<-)
>
> // Receive queue is at 85 buffers and we are out of memory.
> // We drop the incoming buffer, although it is in sequence, and decide
> // to send an advertisement with a window of zero.
> // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
> // we unconditionally shrink the window.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
>
> // After each read, the algorithm for calculating the new receive
> // window in __tcp_cleanup_rbuf() finds it is too small to advertise
> // or to update tp->rcv_wnd.
> // Meanwhile, the peer thinks the window is zero, and will not send
> // any more data to trigger an update from the interrupt mode side.
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
>
> // The above pattern repeats again and again, since nothing changes
> // between the reads.
>
> [...]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]
>
> [5201<->39222]: tcp_recvmsg_locked(->)
> [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
> [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
> [5201<->39222]: NOT calling tcp_send_ack()
> [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
> [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
> [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
> [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
>
> // The receive queue is empty, but no new advertisement has been sent.
> // The peer still thinks the receive window is zero, and sends nothing.
> // We have ended up in a deadlock situation.
>
> Furthermore, we have observed that in these situations this side may
> send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
> as it should be. Backing ack_seq seems to be harmless, but is of
> course still wrong from a protocol viewpoint.
>
> We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
> has been dropped because of memory exhaustion and we have to advertize
> a zero window.
>
> Further testing shows that the connection recovers neatly from the
> squeeze situation, and traffic can continue indefinitely.
>
> Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
> v1: -Posted on Apr 6, 2024
Could you post the link, this was a long time ago and I forgot the context.
> v2: -Improved commit log to clarify how we end up in this situation.
> -After feedback from Eric Dumazet, removed references to use of
> SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
> about how this situation occurs. Those flags are used at the
> peer side's incoming connection, and not on this one.
> ---
> net/ipv4/tcp_output.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 0e5b9a654254..ba295f798e5e 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
> u32 cur_win, new_win;
>
> /* Make the window 0 if we failed to queue the data because we
> - * are out of memory. The window is temporary, so we don't store
> - * it on the socket.
> + * are out of memory.
> */
> - if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
> + if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
> + tp->rcv_wnd = 0;
> + tp->rcv_wup = tp->rcv_nxt;
I wonder if we should not clear tp->pred_flags here ?
Also, any chance you could provide a packetdrill test ?
Your changelog contains traces that are hard to follow.
Thanks.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [net,v2] tcp: correct handling of extreme memory squeeze
@ 2025-01-17 21:40 jmaloy
2025-01-17 22:09 ` Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: jmaloy @ 2025-01-17 21:40 UTC (permalink / raw)
To: netdev, davem
Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson, imagedong,
eric.dumazet, edumazet
From: Jon Maloy <jmaloy@redhat.com>
Testing with iperf3 using the "pasta" protocol splicer has revealed
a bug in the way tcp handles window advertising in extreme memory
squeeze situations.
Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.
However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.
The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.
The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:
// tcp_v4_rcv(->)
// tcp_rcv_established(->)
[5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]: tcp_data_queue(->)
[5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
[5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0
[5201<->39222]: __tcp_transmit_skb(->)
[5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE
[5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0
[5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)
// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
// The above pattern repeats again and again, since nothing changes
// between the reads.
[...]
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0]
[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]: NOT calling tcp_send_ack()
[5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes.
[rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
[copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.
Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.
We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet
has been dropped because of memory exhaustion and we have to advertize
a zero window.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.
Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v1: -Posted on Apr 6, 2024
v2: -Improved commit log to clarify how we end up in this situation.
-After feedback from Eric Dumazet, removed references to use of
SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding
about how this situation occurs. Those flags are used at the
peer side's incoming connection, and not on this one.
---
net/ipv4/tcp_output.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0e5b9a654254..ba295f798e5e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
u32 cur_win, new_win;
/* Make the window 0 if we failed to queue the data because we
- * are out of memory. The window is temporary, so we don't store
- * it on the socket.
+ * are out of memory.
*/
- if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+ if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+ tp->rcv_wnd = 0;
+ tp->rcv_wup = tp->rcv_nxt;
return 0;
+ }
cur_win = tcp_receive_window(tp);
new_win = __tcp_select_window(sk);
--
@@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk)
u32 cur_win, new_win;
/* Make the window 0 if we failed to queue the data because we
- * are out of memory. The window is temporary, so we don't store
- * it on the socket.
+ * are out of memory.
*/
- if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+ if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+ tp->rcv_wnd = 0;
+ tp->rcv_wup = tp->rcv_nxt;
return 0;
+ }
cur_win = tcp_receive_window(tp);
new_win = __tcp_select_window(sk);
--
2.48.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-01-20 16:23 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-16 2:29 [net, v2] tcp: correct handling of extreme memory squeeze Jon Maloy
2025-01-16 21:14 ` Stefano Brivio
2025-01-17 21:40 [net,v2] " jmaloy
2025-01-17 22:09 ` Eric Dumazet
2025-01-17 22:27 ` Stefano Brivio
2025-01-18 17:01 ` Jason Xing
2025-01-18 20:04 ` Neal Cardwell
2025-01-20 5:03 ` Jon Maloy
2025-01-20 16:10 ` Jon Maloy
2025-01-20 16:22 ` Eric Dumazet
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).