public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* Re: [PATCH v5 3/3] tcp: allow retransmit when peer receive window is zero
       [not found]     ` <bd793a8a-02aa-47f0-17b7-d1ab188902b4@redhat.com>
@ 2024-05-17 12:15       ` Stefano Brivio
  2024-05-17 14:13         ` Jon Maloy
  0 siblings, 1 reply; 2+ messages in thread
From: Stefano Brivio @ 2024-05-17 12:15 UTC (permalink / raw)
  To: Jon Maloy; +Cc: passt-dev, lvivier, dgibson

On Thu, 16 May 2024 22:39:53 -0400
Jon Maloy <jmaloy@redhat.com> wrote:

> On 2024-05-16 18:21, Stefano Brivio wrote:
> > On Thu, 16 May 2024 17:33:28 -0400
> > Jon Maloy <jmaloy@redhat.com> wrote:
> >  
> >> A bug in kernel TCP may lead to a deadlock where a zero window is sent
> >> from the peer, while it is unable to send out window updates even after
> >> reads have freed up enough buffer space to permit a larger window.
> >> In this situation, new window advertisemnts from the peer can only be
> >> triggered by packets arriving from this side.
> >>
> >> However, such packets are never sent, because the zero-window condition
> >> currently prevents this side from sending out any packets whatsoever
> >> to the peer.
> >>
> >> We notice that the above bug is triggered *only* after the peer has
> >> dropped an arriving packet because of severe memory squeeze, and that we
> >> hence always enter a retransmission situation when this occurs. This
> >> also means that it goes against the RFC 9293 recommendation that a
> >> previously advertised window never should shrink.
> >>
> >> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
> >> the following statement:
> >> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right
> >> window edge to the left (SHLD-14). However, a sending TCP peer MUST
> >> be robust against window shrinking, which may cause the
> >> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
> >>
> >> If this happens, the sender SHOULD NOT send new data (SHLD-15), but
> >> SHOULD retransmit normally the old unacknowledged data between SND.UNA
> >> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
> >> beyond SND.UNA+SND.WND (MAY-7)"
> >>
> >> We never see the window become negative, but we interpret this as a
> >> recommendation to use the previously available window during
> >> retransmission even when the currently advertised window is zero.
> >>
> >> We use the above mechanism only for timer-induced retransmits, while
> >> the fast-retransmit mechanism won't trigger on this condition.
> >>
> >> It should be noted that although this solves the problem we have at
> >> hand, it is not a genuine solution to the kernel bug. There may well
> >> be TCP stacks around in other OS-es which don't do this, nor have
> >> keep-alive probing as an alternatve way to solve the situation.
> >>
> >> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> >>
> >> ---
> >> v2: - Using previously advertised window during retransmission, instead
> >>        highest send sequencece number in the cycle.
> >> v3: - Rebased to newest code
> >>      - Changes based on feedback from PASST team
> >>      - Sending out empty probe message at timer expiration when
> >>        we are not in retransmit situation.
> >> v4: - Some small changes based on feedback from PASST team.
> >>      - Replaced fast retransmit with a one-time 'fast probe' when
> >>        window is zero.
> >> v5: - Gave up on 'fast probing' for now. When I got the sequence
> >>        numbers right in the flag message (after having emptied the tap
> >>        queue), it turns out an empty message does *not* force a new peer
> >>        window update as was my previous understanding of the code.
> >>      - Added cppcheck suppression line (which I was unable to verify)  
> > ...hmm, why? All it takes is a 'make cppcheck'. Or is your version of
> > cppcheck not showing it?  
> Exactly. My version is too recent, I think.
> >  
> >>        as suggested by S. Brivio.
> >>      - Removed sending of empty probe when window from tap side is zero.
> >>        It looks pointless at the moment, at least for solving the above
> >>        described situation.  
> > So, as far as I understand, this patch now doesn't introduce any
> > functional change, except that if an existing timer happened to be
> > pending (this patch never schedules it),
> >   and the window _we_ are
> > advertising (wnd_to_tap) is zero (I'm not sure why that's relevant),
> > then we'll send an ACK segment.  
> The change is that we really will do a retransmit from the timer handler,
> ignoring that current window is zero.

...but how does the window *we* advertise matter here? You're checking
conn->wnd_to_tap now, not conn->wnd_from_tap. Anyway, if you're now
abandoning this approach, it doesn't matter.

> This solves the deadlock problem
> I have been describing, but it is not a good solution.
> 
> One problem is that new incoming data will cause an EPOLLIN event,
> which will cause a call to tcp_data_from_sock(), which will also ignore
> the zero-window.

Right now it doesn't ignore it. Do you plan to use that function to
check if the usable window is now negative (because the window value is
zero but we have unacknowledged data), and re-send something in that
case?

> This is easy to fix, I realize:
> 
>   if (events & EPOLLIN && conn->wnd_from_tap)
>                 tcp_data_from_sock(c, conn);
> 
> The next is that I would like to start a short timer when I receive
> the zero-window and seq_to_tap > seq_ack_from_tap (same condition
> as in previous versions, but without sending the flag packet).
> As far as I can see there is no simple way to schedule the timer to
> the value I want. I am open to suggestions here.

You could reuse/abuse ACK_TO_TAP_DUE (after all, you want to send a
keep-alive in a while), and schedule a different timeout in
tcp_timer_ctl() if the peer window is zero.

Then, in the timer handler, depending on the peer window, you would
check if you want to use ACK_IF_NEEDED (no keepalive), or ACK (force a
keepalive in any case).

-- 
Stefano


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH v5 3/3] tcp: allow retransmit when peer receive window is zero
  2024-05-17 12:15       ` [PATCH v5 3/3] tcp: allow retransmit when peer receive window is zero Stefano Brivio
@ 2024-05-17 14:13         ` Jon Maloy
  0 siblings, 0 replies; 2+ messages in thread
From: Jon Maloy @ 2024-05-17 14:13 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, lvivier, dgibson



On 2024-05-17 08:15, Stefano Brivio wrote:
> On Thu, 16 May 2024 22:39:53 -0400
> Jon Maloy <jmaloy@redhat.com> wrote:
>
>> On 2024-05-16 18:21, Stefano Brivio wrote:
>>> On Thu, 16 May 2024 17:33:28 -0400
>>> Jon Maloy <jmaloy@redhat.com> wrote:
>>>   
>>>> A bug in kernel TCP may lead to a deadlock where a zero window is sent
>>>> from the peer, while it is unable to send out window updates even after
>>>> reads have freed up enough buffer space to permit a larger window.
>>>> In this situation, new window advertisemnts from the peer can only be
>>>> triggered by packets arriving from this side.
>>>>
>>>> However, such packets are never sent, because the zero-window condition
>>>> currently prevents this side from sending out any packets whatsoever
>>>> to the peer.
>>>>
>>>> We notice that the above bug is triggered *only* after the peer has
>>>> dropped an arriving packet because of severe memory squeeze, and that we
>>>> hence always enter a retransmission situation when this occurs. This
>>>> also means that it goes against the RFC 9293 recommendation that a
>>>> previously advertised window never should shrink.
>>>>
>>>> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
>>>> the following statement:
>>>> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right
>>>> window edge to the left (SHLD-14). However, a sending TCP peer MUST
>>>> be robust against window shrinking, which may cause the
>>>> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
>>>>
>>>> If this happens, the sender SHOULD NOT send new data (SHLD-15), but
>>>> SHOULD retransmit normally the old unacknowledged data between SND.UNA
>>>> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
>>>> beyond SND.UNA+SND.WND (MAY-7)"
>>>>
>>>> We never see the window become negative, but we interpret this as a
>>>> recommendation to use the previously available window during
>>>> retransmission even when the currently advertised window is zero.
>>>>
>>>> We use the above mechanism only for timer-induced retransmits, while
>>>> the fast-retransmit mechanism won't trigger on this condition.
>>>>
>>>> It should be noted that although this solves the problem we have at
>>>> hand, it is not a genuine solution to the kernel bug. There may well
>>>> be TCP stacks around in other OS-es which don't do this, nor have
>>>> keep-alive probing as an alternatve way to solve the situation.
>>>>
>>>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>>>>
>>>> ---
>>>> v2: - Using previously advertised window during retransmission, instead
>>>>         highest send sequencece number in the cycle.
>>>> v3: - Rebased to newest code
>>>>       - Changes based on feedback from PASST team
>>>>       - Sending out empty probe message at timer expiration when
>>>>         we are not in retransmit situation.
>>>> v4: - Some small changes based on feedback from PASST team.
>>>>       - Replaced fast retransmit with a one-time 'fast probe' when
>>>>         window is zero.
>>>> v5: - Gave up on 'fast probing' for now. When I got the sequence
>>>>         numbers right in the flag message (after having emptied the tap
>>>>         queue), it turns out an empty message does *not* force a new peer
>>>>         window update as was my previous understanding of the code.
>>>>       - Added cppcheck suppression line (which I was unable to verify)
>>> ...hmm, why? All it takes is a 'make cppcheck'. Or is your version of
>>> cppcheck not showing it?
>> Exactly. My version is too recent, I think.
>>>   
>>>>         as suggested by S. Brivio.
>>>>       - Removed sending of empty probe when window from tap side is zero.
>>>>         It looks pointless at the moment, at least for solving the above
>>>>         described situation.
>>> So, as far as I understand, this patch now doesn't introduce any
>>> functional change, except that if an existing timer happened to be
>>> pending (this patch never schedules it),
>>>    and the window _we_ are
>>> advertising (wnd_to_tap) is zero (I'm not sure why that's relevant),
>>> then we'll send an ACK segment.
>> The change is that we really will do a retransmit from the timer handler,
>> ignoring that current window is zero.
> ...but how does the window *we* advertise matter here? You're checking
> conn->wnd_to_tap now, not conn->wnd_from_tap. Anyway, if you're now
> abandoning this approach, it doesn't matter.
It doesn't matter at all. I stated in the the comment that I had removed 
this one, but forgot to do  it. :-)

>
>> This solves the deadlock problem
>> I have been describing, but it is not a good solution.
>>
>> One problem is that new incoming data will cause an EPOLLIN event,
>> which will cause a call to tcp_data_from_sock(), which will also ignore
>> the zero-window.
> Right now it doesn't ignore it. Do you plan to use that function to
> check if the usable window is now negative (because the window value is
> zero but we have unacknowledged data), and re-send something in that
> case?
Yes, it is ignoring it, because max_send is based on seq_wnd_edge_from_tap,
which wasn't moved when we received the zero-window adverisement.
>
>> This is easy to fix, I realize:
>>
>>    if (events & EPOLLIN && conn->wnd_from_tap)
>>                  tcp_data_from_sock(c, conn);
This is my plan.
>> The next is that I would like to start a short timer when I receive
>> the zero-window and seq_to_tap > seq_ack_from_tap (same condition
>> as in previous versions, but without sending the flag packet).
>> As far as I can see there is no simple way to schedule the timer to
>> the value I want. I am open to suggestions here.
> You could reuse/abuse ACK_TO_TAP_DUE (after all, you want to send a
> keep-alive in a while), and schedule a different timeout in
> tcp_timer_ctl() if the peer window is zero.
>
> Then, in the timer handler, depending on the peer window, you would
> check if you want to use ACK_IF_NEEDED (no keepalive), or ACK (force a
> keepalive in any case).
>
As it is now, ACK_FROM_TAP_DUE with associated timer is always set when
we receive the zero-window, so it kind of works. The downside is that I
see a glitch of 1-2 seconds now and then until traffic recovers.
Trying with 10 ms as you suggest is much better. I'll give it a try.

///jon



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-05-17 14:13 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240516213328.1049176-1-jmaloy@redhat.com>
     [not found] ` <20240516213328.1049176-4-jmaloy@redhat.com>
     [not found]   ` <20240517002110.1c957b6d@elisabeth>
     [not found]     ` <bd793a8a-02aa-47f0-17b7-d1ab188902b4@redhat.com>
2024-05-17 12:15       ` [PATCH v5 3/3] tcp: allow retransmit when peer receive window is zero Stefano Brivio
2024-05-17 14:13         ` Jon Maloy

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).