public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Jon Maloy <jmaloy@redhat.com>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com
Subject: Re: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero
Date: Wed, 15 May 2024 19:10:49 -0400	[thread overview]
Message-ID: <c7d897ac-9589-f96a-ff51-1eb3af3b7646@redhat.com> (raw)
In-Reply-To: <20240515222417.72ce256b@elisabeth>



On 2024-05-15 16:24, Stefano Brivio wrote:
> On Wed, 15 May 2024 11:34:29 -0400
> Jon Maloy <jmaloy@redhat.com> wrote:
>
>> A bug in kernel TCP may lead to a deadlock where a zero window is sent
>> from the peer, while it is unable to send out window updates even after
>> reads have freed up enough buffer space to permit a larger window.
>> In this situation, new window advertisemnts from the peer can only be
>> triggered by packets arriving from this side.
>>
>> However, such packets are never sent, because the zero-window condition
>> currently prevents this side from sending out any packets whatsoever
>> to the peer.
>>
>> We notice that the above bug is triggered *only* after the peer has
>> dropped an arriving packet because of severe memory squeeze, and that we
>> hence always enter a retransmission situation when this occurs. This
>> also means that it goes against the RFC 9293 recommendation that a
>> previously advertised window never should shrink.
>>
>> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
>> the following statement:
>> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right
>> window edge to the left (SHLD-14). However, a sending TCP peer MUST
>> be robust against window shrinking, which may cause the
>> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
>>
>> If this happens, the sender SHOULD NOT send new data (SHLD-15), but
>> SHOULD retransmit normally the old unacknowledged data between SND.UNA
>> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
>> beyond SND.UNA+SND.WND (MAY-7)"
>>
>> We never see the window become negative, but we interpret this as a
>> recommendation to use the previously available window during
>> retransmission even when the currently advertised window is zero.
>>
>> We use the above mechanism only at timer-induced retransmits.
>> In the case we receive duplicate ack and a zero window, but
>> still know we have outstanding data acks waiting, we send out an
>> empty "fast probe" instead of doing fast retransmit. This averts
>> the risk of overwhelming a memory squeezed peer with retransmits,
>> while still forcing it to send out a new window update when the
>> probe is received. This entails a theoretical risk of redundant
>> retransmits from the peer, but that is a risk worth taking.
>>
>> In case of a zero-window non-retransmission situation where there
>> is no new data to be sent, we also add a simple zero-window probing
>> feature. By sending an empty packet at regular timeout events we
>> resolve the situation described above, since the peer receives the
>> necessary trigger to advertise its window once it becomes non-zero
>> again.
>>
>> It should be noted that although this solves the problem we have at
>> hand, it is not a genuine solution to the kernel bug. There may well
>> be TCP stacks around in other OS-es which don't do this, nor have
>> keep-alive probing as an alternatve way to solve the situation.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>>
>> ---
>> v2: - Using previously advertised window during retransmission, instead
>>        highest send sequencece number in the cycle.
>> v3: - Rebased to newest code
>>      - Changes based on feedback from PASST team
>>      - Sending out empty probe message at timer expiration when
>>        we are not in retransmit situation.
>> v4: - Some small changes based on feedback from PASST team.
>>      - Replaced fast retransmit with a one-time 'fast probe' when
>>        window is zero.
>> ---
>>   tcp.c      | 32 +++++++++++++++++++++++++++-----
>>   tcp_conn.h |  2 ++
>>   2 files changed, 29 insertions(+), 5 deletions(-)
>>
>> diff --git a/tcp.c b/tcp.c
>> index 4163bf9..a33f494 100644
>> --- a/tcp.c
>> +++ b/tcp.c
>> @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
>>    */
>>   static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
>>   {
>> +	uint32_t wnd_edge;
>> +
>>   	wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
>>   	conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
>>   
>> +	wnd_edge = conn->seq_ack_from_tap + wnd;
>> +	if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))
> Here, cppcheck ('make cppcheck') says:
>
> tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrueFalse]
>   if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))
>       ^
> tcp.c:1766:8: note: Assignment 'wnd=((1<<(16+8))<(wnd<<conn->ws_from_tap))?(1<<(16+8)):(wnd<<conn->ws_from_tap)', assigned value is less than 1
>   wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
>         ^
> tcp.c:1770:6: note: Condition 'wnd' is always true
>   if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))
>       ^
>
> See the comment in tcp_update_seqack_wnd() and related suppression.
>
> It's clearly a false positive (if you omit the MIN() macro, it goes
> away), so we need that same suppression here.
Ok. I'll change it. Still a little annoying when our tools are causing 
us extra job because they aren't up to the task.
>
>> +		conn->seq_wnd_edge_from_tap = wnd_edge;
>> +
>>   	/* FIXME: reflect the tap-side receiver's window back to the sock-side
>>   	 * sender by adjusting SO_RCVBUF? */
>>   }
>> @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
>>   	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
>>   
>>   	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
>> +	conn->seq_wnd_edge_from_tap = conn->seq_to_tap;
>>   }
>>   
>>   /**
>> @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>>    */
>>   static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>>   {
>> -	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>>   	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
>>   	int sendlen, len, dlen, v4 = CONN_V4(conn);
>> +	uint32_t already_sent, max_send, seq;
>>   	int s = conn->sock, i, ret = 0;
>>   	struct msghdr mh_sock = { 0 };
>>   	uint16_t mss = MSS_GET(conn);
>> -	uint32_t already_sent, seq;
>>   	struct iovec *iov;
>>   
>>   	/* How much have we read/sent since last received ack ? */
>> @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>>   		tcp_set_peek_offset(s, 0);
>>   	}
>>   
>> -	if (!wnd_scaled || already_sent >= wnd_scaled) {
>> +	/* How much are we still allowed to send within current window ? */
>> +	max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap;
>> +	if (SEQ_LE(max_send, 0)) {
>> +		flow_trace(conn, "Empty window: win_upper: %u, sent: %u",
> This is not win_upper anymore, and the window is actually full rather
> than empty (it's... empty of space). Maybe:
>
> 		flow_trace(conn, "Window full: right edge: %u, sent: %u"
yes.
>
>> +			   conn->seq_wnd_edge_from_tap, conn->seq_to_tap);
>> +		conn->seq_wnd_edge_from_tap = conn->seq_to_tap;
>>   		conn_flag(c, conn, STALLED);
>>   		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>>   		return 0;
>>   	}
>>   
>>   	/* Set up buffer descriptors we'll fill completely and partially. */
>> -	fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
>> +	fill_bufs = DIV_ROUND_UP(max_send,  mss);
>>   	if (fill_bufs > TCP_FRAMES) {
>>   		fill_bufs = TCP_FRAMES;
>>   		iov_rem = 0;
>>   	} else {
>> -		iov_rem = (wnd_scaled - already_sent) % mss;
>> +		iov_rem = max_send % mss;
>>   	}
>>   
>>   	/* Prepare iov according to kernel capability */
>> @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>>   		conn->seq_to_tap = max_ack_seq;
>>   		tcp_set_peek_offset(conn->sock, 0);
>>   		tcp_data_from_sock(c, conn);
>> +	} else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_seq)) {
>> +		/* Force peer to send new advertisement now, but only once */
> Two questions:
>
> - which advertisement? We're sending a zero-window probe, not forcing
>    the peer to do much really. I would rather just state that we're
>    sending a probe
Actually, it is only clear if you know the code of the (linux) peer.
I realized this was maybe a too strong statement, but this is really
what happens.

>
> - what guarantees it only happens once? If we get more data from the
>    socket, we'll get again SEQ_GT(conn->seq_to_tap, max_ack_seq) in a
>    bit, and send another ACK (duplicate) to the peer, without the peer
>    necessarily ever advertising a non-zero window meanwhile
Right. I need to give this  a new spin, but I think I am on the right 
track.
I *really* want this to be solved here, and not need to fall back to
  the timeout except under exceptional conditions.
This happens really often in my tests, and with this fix it actually
works like a charm.

>
>    I'm struggling a bit to understand how this can work "cleanly", a
>    packet capture of this mechanism in action would certainly help.
Ok. In v5.
>
>> +		flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u",
>> +			   max_ack_seq, conn->seq_to_tap);
>> +		tcp_send_flag(c, conn, ACK);
>> +		conn->seq_to_tap = max_ack_seq;
>> +		tcp_set_peek_offset(conn->sock, 0);
>>   	}
>>   
>>   	if (!iov_i)
>> @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>>   			flow_dbg(conn, "activity timeout");
>>   			tcp_rst(c, conn);
>>   		}
>> +		/* No data exchanged recently? Keep connection alive. */
> ...I just spotted this from v3: this is not the reason why we're
> sending a keep-alive. We're sending a keep-alive segment because the
> peer advertised its window as zero.
>
> I also realised that this is not scheduled additionally, so it will
> just trigger on an activity timeout, I suppose. We should reschedule
> this after ACK_TIMEOUT, instead (that was my earlier suggestion, I
> didn't check anymore) when the peer advertises a zero window.
>
>> +		if (conn->seq_to_tap == conn->seq_ack_from_tap &&
> ...this part will only work if we reset seq_to_tap to seq_ack_from_tap
> earlier, and we have no pending data to send, which is not necessarily
> the case if we want to send a zero-window probe.
>
>> +		    conn->seq_from_tap == conn->seq_ack_to_tap)
>> +			tcp_send_flag(c, conn, ACK);
> I think the conditions should simply be:
>
> - the window currently advertised by the peer is zero
>
> - we don't have pending data to acknowledge (otherwise the peer can
>    interpret our keep-alive as a duplicate ACK)
That is ok.
The deadlock situation we have ended up in will anyway be resolved by the
real retransmit that will happen before we get here.
I am now wondering if this probe it serves any purpose at all?
>
>>   	}
>>   }
>>   
>> diff --git a/tcp_conn.h b/tcp_conn.h
>> index d280b22..5cbad2a 100644
>> --- a/tcp_conn.h
>> +++ b/tcp_conn.h
>> @@ -30,6 +30,7 @@
>>    * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
>>    * @seq_to_tap:		Next sequence for packets to tap
>>    * @seq_ack_from_tap:	Last ACK number received from tap
>> + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap
>>    * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
>>    * @seq_ack_to_tap:	Last ACK number sent to tap
>>    * @seq_init_from_tap:	Initial sequence number from tap
>> @@ -101,6 +102,7 @@ struct tcp_tap_conn {
>>   
>>   	uint32_t	seq_to_tap;
>>   	uint32_t	seq_ack_from_tap;
>> +	uint32_t	seq_wnd_edge_from_tap;
>>   	uint32_t	seq_from_tap;
>>   	uint32_t	seq_ack_to_tap;
>>   	uint32_t	seq_init_from_tap;

I'll re-spin the two first ones asap, so you can apply them, and then I 
will try to improve
this one further.

///jon


  reply	other threads:[~2024-05-15 23:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-15 15:34 [PATCH v4 0/3] Support for SO_PEEK_OFF socket option Jon Maloy
2024-05-15 15:34 ` [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued Jon Maloy
2024-05-15 20:20   ` Stefano Brivio
2024-05-16  2:24   ` David Gibson
2024-05-16  2:57     ` Jon Maloy
2024-05-16  4:16       ` David Gibson
2024-06-04 17:36         ` Jon Maloy
2024-06-04 18:04           ` Jon Maloy
2024-06-04 18:10             ` Stefano Brivio
2024-05-15 15:34 ` [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-05-15 20:22   ` Stefano Brivio
2024-05-16  2:29   ` David Gibson
2024-05-16  3:03     ` Jon Maloy
2024-05-15 15:34 ` [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-05-15 20:24   ` Stefano Brivio
2024-05-15 23:10     ` Jon Maloy [this message]
2024-05-16  7:19       ` David Gibson
2024-05-16 11:22       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c7d897ac-9589-f96a-ff51-1eb3af3b7646@redhat.com \
    --to=jmaloy@redhat.com \
    --cc=dgibson@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).