public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Cc: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
	dgibson@redhat.com
Subject: Re: [PATCH 2/2] tcp: allow retransmit when peer receive window is zero
Date: Wed, 24 Apr 2024 11:04:51 +1000	[thread overview]
Message-ID: <Zihas53XonHtyLI0@zatzit> (raw)
In-Reply-To: <20240420191920.104876-3-jmaloy@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 7413 bytes --]

On Sat, Apr 20, 2024 at 03:19:20PM -0400, Jon Maloy wrote:
> A bug in kernel TCP may lead to a deadlock where a zero window is sent
> from the peer, while buffer reads doesn't lead to it being updated.
> At the same time, the zero window stops this side from sending out
> more data to trigger new advertisements to be sent from the peer.

This is a bit confusing to me.  I guess the buffer reads must be
happening on the peer, not "this side", but that's not very clear.  I
think "received from the peer" might be better than "sent from the
peer" to make it more obvious that the point of view is uniformly from
"this side".

> RFC 793 states that it always is permitted for a sender to send one byte
> of data even when the window is zero. This resolves the deadlock described
> above, so we choose to introduce it here as a last resort. We allow it both
> during fast and as keep-alives when the timer sees no activity on
> the connection.

Looks like there's a missing noun after "during fast".

> However, we notice that this solution doesn´t work well. Traffic sometimes
> goes to zero, and onley recovers after the timer has resolved the situation.

s/onley/only/

> Because of this, we chose to improve it slightly: The deadlock happens when a

Is it actually useful to describe the "bad" workaround if we're using
a different one?  I don't see how the better workaround is an obvious
extension of the bad one.

> packet has been dropped at the peer end because of memory squeeze. We therefore
> consider it legitimate to retransmit that packet while considering the window
> size that was valid at the moment it was first transmitted. This works
> much better.
> 
> It should be noted that although this solves the problem we have at hand,
> it is not a genuine solution to the kernel bug. There may well be TCP stacks
> around in other OS-es which don't do this probing.
> 
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  tcp.c      | 26 ++++++++++++++++----------
>  tcp_conn.h |  2 ++
>  2 files changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 95d400a..9dea151 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1774,6 +1774,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
>  	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
>  
>  	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
> +	conn->max_seq_to_tap = conn->seq_to_tap;
>  }
>  
>  /**
> @@ -2123,9 +2124,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>   *
>   * #syscalls recvmsg
>   */
> -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn, uint32_t wnd_scaled)
>  {
> -	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
>  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
>  	int sendlen, len, plen, v4 = CONN_V4(conn);
>  	int s = conn->sock, i, ret = 0;
> @@ -2212,6 +2212,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  
>  		return 0;
>  	}
> +	sendlen = len;
> +	if (!peek_offset_cap)
> +		sendlen -= already_sent;
>  
>  	sendlen = len;
>  	if (!peek_offset_cap)

Looks like some duplicated lines from a bad rebase.

> @@ -2241,7 +2244,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  		tcp_data_to_tap(c, conn, plen, no_csum, seq);
>  		seq += plen;
>  	}
> -
> +	/* We need this to know this during retransmission: */
> +	if (SEQ_GT(seq, conn->max_seq_to_tap))
> +		conn->max_seq_to_tap = seq;
>  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  
>  	return 0;
> @@ -2317,8 +2322,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  			    SEQ_GE(ack_seq, max_ack_seq)) {
>  				/* Fast re-transmit */
>  				retr = !len && !th->fin &&
> -				       ack_seq == max_ack_seq &&
> -				       ntohs(th->window) == max_ack_seq_wnd;
> +				       ack_seq == max_ack_seq;
>  
>  				max_ack_seq_wnd = ntohs(th->window);
>  				max_ack_seq = ack_seq;
> @@ -2385,9 +2389,10 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  		flow_trace(conn,
>  			   "fast re-transmit, ACK: %u, previous sequence: %u",
>  			   max_ack_seq, conn->seq_to_tap);
> +
>  		conn->seq_ack_from_tap = max_ack_seq;
>  		conn->seq_to_tap = max_ack_seq;
> -		tcp_data_from_sock(c, conn);
> +		tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_ack_from_tap));

Does the MAX do anything: if max_seq_to_tap equals seq_ack_from_tap
then, by definition, there is nothing to retransmit - everything has
been acked.

>  	}
>  
>  	if (!iov_i)
> @@ -2483,7 +2488,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
>  	/* The client might have sent data already, which we didn't
>  	 * dequeue waiting for SYN,ACK from tap -- check now.
>  	 */
> -	tcp_data_from_sock(c, conn);
> +	tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
>  	tcp_send_flag(c, conn, ACK);
>  }
>  
> @@ -2575,7 +2580,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
>  
>  		tcp_tap_window_update(conn, ntohs(th->window));
>  
> -		tcp_data_from_sock(c, conn);
> +		tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
>  
>  		if (p->count - idx == 1)
>  			return 1;
> @@ -2788,7 +2793,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>  			flow_dbg(conn, "ACK timeout, retry");
>  			conn->retrans++;
>  			conn->seq_to_tap = conn->seq_ack_from_tap;
> -			tcp_data_from_sock(c, conn);
> +			tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_ack_from_tap));
>  			tcp_timer_ctl(c, conn);
>  		}
>  	} else {
> @@ -2807,6 +2812,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
>  			tcp_rst(c, conn);
>  		}
>  	}
> +

Spurious change

>  }
>  
>  /**
> @@ -2843,7 +2849,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
>  			conn_event(c, conn, SOCK_FIN_RCVD);
>  
>  		if (events & EPOLLIN)
> -			tcp_data_from_sock(c, conn);
> +			tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
>  
>  		if (events & EPOLLOUT)
>  			tcp_update_seqack_wnd(c, conn, 0, NULL);
> diff --git a/tcp_conn.h b/tcp_conn.h
> index a5f5cfe..afcdec9 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -29,6 +29,7 @@
>   * @wnd_from_tap:	Last window size from tap, unscaled (as received)
>   * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
>   * @seq_to_tap:		Next sequence for packets to tap
> + * @max_seq_to_tap:	Next seq after highest ever sent. Needeed during retransmit
>   * @seq_ack_from_tap:	Last ACK number received from tap
>   * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
>   * @seq_ack_to_tap:	Last ACK number sent to tap
> @@ -100,6 +101,7 @@ struct tcp_tap_conn {
>  	uint16_t	wnd_to_tap;
>  
>  	uint32_t	seq_to_tap;
> +	uint32_t	max_seq_to_tap;
>  	uint32_t	seq_ack_from_tap;
>  	uint32_t	seq_from_tap;
>  	uint32_t	seq_ack_to_tap;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2024-04-24  1:04 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-20 19:19 [PATCH 0/2] Support for SO_PEEK_OFF when a available Jon Maloy
2024-04-20 19:19 ` [PATCH 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-04-23 17:50   ` Stefano Brivio
2024-04-24  0:48     ` David Gibson
2024-04-24 18:30       ` Stefano Brivio
2024-04-26  3:27         ` David Gibson
2024-04-26  5:58           ` Stefano Brivio
2024-04-29  1:46             ` David Gibson
2024-04-25 23:06       ` Jon Maloy
2024-04-24  0:44   ` David Gibson
2024-04-25 23:23     ` Jon Maloy
2024-04-26  3:29       ` David Gibson
2024-04-20 19:19 ` [PATCH 2/2] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-04-24  1:04   ` David Gibson [this message]
2024-04-24 18:31     ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zihas53XonHtyLI0@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=dgibson@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).