public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Jon Maloy <jmaloy@redhat.com>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com
Subject: Re: [PATCH v3 3/3] tcp: allow retransmit when peer receive window is zero
Date: Tue, 14 May 2024 16:19:16 -0400	[thread overview]
Message-ID: <b20434e0-135f-4a67-2601-11ef0933bf02@redhat.com> (raw)
In-Reply-To: <20240514194650.53b433f9@elisabeth>



On 2024-05-14 13:46, Stefano Brivio wrote:
> On Sat, 11 May 2024 11:20:08 -0400
> Jon Maloy <jmaloy@redhat.com> wrote:
>
>> A bug in kernel TCP may lead to a deadlock where a zero window is sent
>> from the peer, while it is unable to send out window updates even after
>> reads have freed up enough buffer space to permit a larger window.
>> In this situation, new window advertisemnts from the peer can only be
>> triggered by packets arriving from this side.
>>
>> However, such packets are never sent, because the zero-window condition
>> currently prevents this side from sending out any packets whatsoever
>> to the peer.
>>
>> We notice that the above bug is triggered *only* after the peer has
>> dropped an arriving packet because of severe memory squeeze, and that we
>> hence always enter a retransmission situation when this occurs. This
>> also means that it goes against the RFC 9293 recommendation that a
>> previously advertised window never should shrink.
>>
>> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
>> the following statement:
>> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right
>> window edge to the left (SHLD-14). However, a sending TCP peer MUST
>> be robust against window shrinking, which may cause the
>> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
>>
>> If this happens, the sender SHOULD NOT send new data (SHLD-15), but
>> SHOULD retransmit normally the old unacknowledged data between SND.UNA
>> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
>> beyond SND.UNA+SND.WND (MAY-7)"
>>
>> We never see the window become negative, but we interpret this as a
>> recommendation to use the previously available window during
>> retransmission even when the currently advertised window is zero.
>>
>> In case of a zero-window non-retransmission situation where there
>> is no new data to be sent, we also add a simple zero-window probing
>> feature. By sending an empty packet at regular timeout events we
>> resolve the situation described above, since the peer receives the
>> necessary trigger to advertise its window once it becomes non-zero
>> again.
>>
>> It should be noted that although this solves the problem we have at
>> hand, it is not a genuine solution to the kernel bug. There may well
>> be TCP stacks around in other OS-es which don't do this, nor have
>> keep-alive probing as an alternatve way to solve the situation.
>>
>> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>>
>> ---
>> v2: - Using previously advertised window during retransmission, instead
>>        highest send sequencece number in the cycle.
>> v3: - Rebased to newest code
>>      - Changes based on feedback from PASST team
>>      - Sending out empty probe message at timer expiration when
>>        we are not in retransmit situation.
>> ---
>>   tcp.c      | 30 +++++++++++++++++++++---------
>>   tcp_conn.h |  2 ++
>>   2 files changed, 23 insertions(+), 9 deletions(-)
>>
>> diff --git a/tcp.c b/tcp.c
>> index 8297812..bd6bf35 100644
>> --- a/tcp.c
>> +++ b/tcp.c
>> @@ -1774,9 +1774,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
>>    */
>>   static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
>>   {
>> +	uint32_t wnd_upper;
>> +
>>   	wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
>>   	conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
>>   
>> +	wnd_upper = conn->seq_ack_from_tap + wnd;
>> +	if (wnd && SEQ_GT(wnd_upper, conn->seq_wup_from_tap))
>> +		conn->seq_wup_from_tap = wnd_upper;
>> +
>>   	/* FIXME: reflect the tap-side receiver's window back to the sock-side
>>   	 * sender by adjusting SO_RCVBUF? */
>>   }
>> @@ -1809,6 +1815,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
>>   	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
>>   
>>   	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
>> +	conn->seq_wup_from_tap = conn->seq_to_tap;
>>   }
>>   
>>   /**
>> @@ -2220,7 +2227,6 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>>    */
>>   static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>>   {
>> -	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>>   	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
>>   	int sendlen, len, dlen, v4 = CONN_V4(conn);
>>   	uint32_t max_send, seq, already_sent;
>> @@ -2241,10 +2247,11 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>>   	}
>>   
>>   	/* How much are we still allowed to send within current window ? */
>> -	max_send = conn->seq_ack_from_tap + wnd_scaled - conn->seq_to_tap;
>> +	max_send = conn->seq_wup_from_tap - conn->seq_to_tap;
>>   	if (SEQ_LE(max_send, 0)) {
>> -		flow_trace(conn, "Empty window: win: %u, sent: %u",
>> -			   wnd_scaled, conn->seq_to_tap);
>> +		flow_trace(conn, "Empty window: win_upper: %u, sent: %u",
>> +			   conn->seq_wup_from_tap, conn->seq_to_tap);
>> +		conn->seq_wup_from_tap = conn->seq_to_tap;
>>   		conn_flag(c, conn, STALLED);
>>   		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>>   		return 0;
>> @@ -2380,7 +2387,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>>   	ASSERT(conn->events & ESTABLISHED);
>>   
>>   	for (i = idx, iov_i = 0; i < (int)p->count; i++) {
>> -		uint32_t seq, seq_offset, ack_seq;
>> +		uint32_t seq, seq_offset, ack_seq, wnd;
>>   		const struct tcphdr *th;
>>   		char *data;
>>   		size_t off;
>> @@ -2413,11 +2420,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>>   			if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) &&
>>   			    SEQ_GE(ack_seq, max_ack_seq)) {
>>   				/* Fast re-transmit */
>> +				wnd = ntohs(th->window);
>>   				retr = !len && !th->fin &&
>>   				       ack_seq == max_ack_seq &&
>> -				       ntohs(th->window) == max_ack_seq_wnd;
>> +				       (wnd == max_ack_seq_wnd || !wnd);
> Just as a reminder, as I mentioned on Monday: this means we'll
> re-transmit whenever we get a pure window update (!len && !th->fin
> && ack_seq == max_ack_seq) with a zero window. The receiver is telling
> us it ran out of space, and wham, we flood them, as a punishment.
>
> I would let this check alone, and just add zero-window probing, plus
> whatever retransmission you mentioned from the RFC -- but not a fast
> re-transmit on a zero window.
I think I have a good idea here. I'll use it in my next version.

>
>>   
>> -				max_ack_seq_wnd = ntohs(th->window);
>> +				max_ack_seq_wnd = wnd;
>>   				max_ack_seq = ack_seq;
>>   			}
>>   		}
>> @@ -2480,8 +2488,9 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>>   
>>   	if (retr) {
>>   		flow_trace(conn,
>> -			   "fast re-transmit, ACK: %u, previous sequence: %u",
>> -			   max_ack_seq, conn->seq_to_tap);
>> +			   "fast re-transmit, seqno %u -> %u, win_upper: %u",
>> +			   conn->seq_to_tap, max_ack_seq,
> I'm not sure if "->" really conveys the meaning of "we're sending this
> sequence *because* of that acknowledgement number".
It really means "we are rewinding seq_to_tap from X to Y".  That it is 
caused by a
duplicate ack is implicit.
> I would rather keep
> the received acknowledged sequence before everything else, because
> that's the causal trigger for the retransmission.
>
>> +			   conn->seq_wup_from_tap);
>>   
>>   		conn->seq_to_tap = max_ack_seq;
>>   		tcp_set_peek_offset(conn->sock, 0);
>> @@ -2931,6 +2940,9 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>>   			flow_dbg(conn, "activity timeout");
>>   			tcp_rst(c, conn);
>>   		}
>> +		/* No data sent recently? Keep connection alive. */
>> +		if (conn->seq_to_tap == conn->seq_ack_from_tap)
>> +			tcp_send_flag(c, conn, ACK_IF_NEEDED);
> If the window is zero, this won't send anything, see the first
> condition in tcp_send_flag(). ACK_IF_NEEDED implies that that function
> should queue an ACK segment if we have data to acknowledge.
Ok. I missed that.
> Here, the flag you want is simply 'ACK'. But we should make sure that
> this can't be taken as a duplicate ACK, that is, we should only send
> this if seq_ack_to_tap == seq_from_tap.
>
> Otherwise, we shouldn't send anything, lest the peer retransmit
> anything that we didn't acknowledge yet.
But then we have no probing... Wasn't that the whole pint of this?
>
>>   	}
>>   }
>>   
>> diff --git a/tcp_conn.h b/tcp_conn.h
>> index d280b22..8ae20ef 100644
>> --- a/tcp_conn.h
>> +++ b/tcp_conn.h
>> @@ -30,6 +30,7 @@
>>    * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
>>    * @seq_to_tap:		Next sequence for packets to tap
>>    * @seq_ack_from_tap:	Last ACK number received from tap
>> + * @seq_wup_from_tap:	Right edge of last non-zero window from tap
> "Right edge" makes much more sense to me, and it also matches RFC
> language. Could we turn all the "wup" and "upper" references into
> something like "edge" or "right_edge"?
I tried to come up with something short, because the field name becomes
impractically long. I am open to suggestions.

///jon
>
>>    * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
>>    * @seq_ack_to_tap:	Last ACK number sent to tap
>>    * @seq_init_from_tap:	Initial sequence number from tap
>> @@ -101,6 +102,7 @@ struct tcp_tap_conn {
>>   
>>   	uint32_t	seq_to_tap;
>>   	uint32_t	seq_ack_from_tap;
>> +	uint32_t	seq_wup_from_tap;
>>   	uint32_t	seq_from_tap;
>>   	uint32_t	seq_ack_to_tap;
>>   	uint32_t	seq_init_from_tap;


  reply	other threads:[~2024-05-14 20:19 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-11 15:20 [PATCH v3 0/3] Support for SO_PEEK_OFF socket option Jon Maloy
2024-05-11 15:20 ` [PATCH v3 1/3] tcp: move seq_to_tap update to when frame is queued Jon Maloy
2024-05-13  2:09   ` David Gibson
2024-05-14 16:48   ` Stefano Brivio
2024-05-11 15:20 ` [PATCH v3 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-05-13  2:23   ` David Gibson
2024-05-14 17:22   ` Stefano Brivio
2024-05-14 20:06     ` Jon Maloy
2024-05-14 21:00       ` Stefano Brivio
2024-05-11 15:20 ` [PATCH v3 3/3] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-05-14 17:46   ` Stefano Brivio
2024-05-14 20:19     ` Jon Maloy [this message]
2024-05-14 21:09       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b20434e0-135f-4a67-2601-11ef0933bf02@redhat.com \
    --to=jmaloy@redhat.com \
    --cc=dgibson@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).