public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>,
	passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com
Subject: Re: [PATCH 2/2] tcp: allow retransmit when peer receive window is zero
Date: Wed, 24 Apr 2024 20:31:36 +0200	[thread overview]
Message-ID: <20240424203136.3c142249@elisabeth> (raw)
In-Reply-To: <Zihas53XonHtyLI0@zatzit>

On top of David's comments:

On Wed, 24 Apr 2024 11:04:51 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sat, Apr 20, 2024 at 03:19:20PM -0400, Jon Maloy wrote:
> > A bug in kernel TCP may lead to a deadlock where a zero window is sent
> > from the peer, while buffer reads doesn't lead to it being updated.
> > At the same time, the zero window stops this side from sending out
> > more data to trigger new advertisements to be sent from the peer.  
> 
> This is a bit confusing to me.  I guess the buffer reads must be
> happening on the peer, not "this side", but that's not very clear.  I
> think "received from the peer" might be better than "sent from the
> peer" to make it more obvious that the point of view is uniformly from
> "this side".
> 
> > RFC 793 states that it always is permitted for a sender to send one byte
> > of data even when the window is zero.

As incredible as it might sound, RFC 793 is now obsolete :) -- you
should refer to RFC 9293 instead.

Further, RFC 9293 3.8.6.1 (just like RFC 793) says that we *must*
(albeit not MUST) "regularly" send that octet -- as long as we have one
available. Not that it's permitted.

Hence, a general comment on this patch: my understanding is that we
want to implement zero-window probing, instead of triggering a fast
re-transmit whenever we get a keep-alive packet from the peer, because
we don't know if we'll get one.

RFC 9293 3.8.6.1 goes on saying:

  The transmitting host SHOULD send the first zero-window probe when a
  zero window has existed for the retransmission timeout period (SHLD-29)
  (Section 3.8.1), and SHOULD increase exponentially the interval between
  successive probes (SHLD-30).

but we currently don't compute the retransmission timeout according to
RFC 6298 (yeah, I know...).

Given that it's a SHOULD, and complying with that is clearly beyond the
scope of this change, we can just use one of the existing timeouts and
timers (ACK_TIMEOUT would do, for the moment).

> > This resolves the deadlock described
> > above, so we choose to introduce it here as a last resort. We allow it both
> > during fast and as keep-alives when the timer sees no activity on
> > the connection.  
> 
> Looks like there's a missing noun after "during fast".
> 
> > However, we notice that this solution doesn´t work well. Traffic sometimes
> > goes to zero, and onley recovers after the timer has resolved the situation.  
> 
> s/onley/only/
> 
> > Because of this, we chose to improve it slightly: The deadlock happens when a  
> 
> Is it actually useful to describe the "bad" workaround if we're using
> a different one?  I don't see how the better workaround is an obvious
> extension of the bad one.
> 
> > packet has been dropped at the peer end because of memory squeeze. We therefore
> > consider it legitimate to retransmit that packet while considering the window
> > size that was valid at the moment it was first transmitted. This works
> > much better.
> > 
> > It should be noted that although this solves the problem we have at hand,
> > it is not a genuine solution to the kernel bug. There may well be TCP stacks
> > around in other OS-es which don't do this probing.
> > 
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> > ---
> >  tcp.c      | 26 ++++++++++++++++----------
> >  tcp_conn.h |  2 ++
> >  2 files changed, 18 insertions(+), 10 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 95d400a..9dea151 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1774,6 +1774,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
> >  
> >  	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
> > +	conn->max_seq_to_tap = conn->seq_to_tap;
> >  }
> >  
> >  /**
> > @@ -2123,9 +2124,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >   *
> >   * #syscalls recvmsg
> >   */
> > -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> > +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn, uint32_t wnd_scaled)

This exceeds 80 columns for no particularly good reason, and the
function comment should be updated.

> >  {
> > -	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> >  	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
> >  	int sendlen, len, plen, v4 = CONN_V4(conn);
> >  	int s = conn->sock, i, ret = 0;
> > @@ -2212,6 +2212,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  
> >  		return 0;
> >  	}
> > +	sendlen = len;
> > +	if (!peek_offset_cap)
> > +		sendlen -= already_sent;
> >  
> >  	sendlen = len;
> >  	if (!peek_offset_cap)  
> 
> Looks like some duplicated lines from a bad rebase.
> 
> > @@ -2241,7 +2244,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
> >  		tcp_data_to_tap(c, conn, plen, no_csum, seq);
> >  		seq += plen;
> >  	}
> > -
> > +	/* We need this to know this during retransmission: */

s/this to know this/to know this/ (I guess)

> > +	if (SEQ_GT(seq, conn->max_seq_to_tap))
> > +		conn->max_seq_to_tap = seq;
> >  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> >  
> >  	return 0;
> > @@ -2317,8 +2322,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> >  			    SEQ_GE(ack_seq, max_ack_seq)) {
> >  				/* Fast re-transmit */
> >  				retr = !len && !th->fin &&
> > -				       ack_seq == max_ack_seq &&
> > -				       ntohs(th->window) == max_ack_seq_wnd;
> > +				       ack_seq == max_ack_seq;

This matches any keep-alive (that is, a segment without data and
without a window update) we get: I think you should replace the
existing condition with a check that the window is zero, instead.

> >  
> >  				max_ack_seq_wnd = ntohs(th->window);
> >  				max_ack_seq = ack_seq;
> > @@ -2385,9 +2389,10 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
> >  		flow_trace(conn,
> >  			   "fast re-transmit, ACK: %u, previous sequence: %u",
> >  			   max_ack_seq, conn->seq_to_tap);
> > +

Spurious change.

> >  		conn->seq_ack_from_tap = max_ack_seq;
> >  		conn->seq_to_tap = max_ack_seq;
> > -		tcp_data_from_sock(c, conn);
> > +		tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_ack_from_tap));  
> 
> Does the MAX do anything: if max_seq_to_tap equals seq_ack_from_tap
> then, by definition, there is nothing to retransmit - everything has
> been acked.
> 
> >  	}
> >  
> >  	if (!iov_i)
> > @@ -2483,7 +2488,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
> >  	/* The client might have sent data already, which we didn't
> >  	 * dequeue waiting for SYN,ACK from tap -- check now.
> >  	 */
> > -	tcp_data_from_sock(c, conn);
> > +	tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
> >  	tcp_send_flag(c, conn, ACK);
> >  }
> >  
> > @@ -2575,7 +2580,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
> >  
> >  		tcp_tap_window_update(conn, ntohs(th->window));
> >  
> > -		tcp_data_from_sock(c, conn);
> > +		tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
> >  
> >  		if (p->count - idx == 1)
> >  			return 1;
> > @@ -2788,7 +2793,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> >  			flow_dbg(conn, "ACK timeout, retry");
> >  			conn->retrans++;
> >  			conn->seq_to_tap = conn->seq_ack_from_tap;
> > -			tcp_data_from_sock(c, conn);
> > +			tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_ack_from_tap));
> >  			tcp_timer_ctl(c, conn);
> >  		}
> >  	} else {
> > @@ -2807,6 +2812,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
> >  			tcp_rst(c, conn);
> >  		}
> >  	}
> > +  
> 
> Spurious change
> 
> >  }
> >  
> >  /**
> > @@ -2843,7 +2849,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
> >  			conn_event(c, conn, SOCK_FIN_RCVD);
> >  
> >  		if (events & EPOLLIN)
> > -			tcp_data_from_sock(c, conn);
> > +			tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
> >  
> >  		if (events & EPOLLOUT)
> >  			tcp_update_seqack_wnd(c, conn, 0, NULL);
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index a5f5cfe..afcdec9 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -29,6 +29,7 @@
> >   * @wnd_from_tap:	Last window size from tap, unscaled (as received)
> >   * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
> >   * @seq_to_tap:		Next sequence for packets to tap
> > + * @max_seq_to_tap:	Next seq after highest ever sent. Needeed during retransmit

Sequence numbers regularly wrap around (they are 32 bits), so you can't
really define this. I'm not entirely sure how you use it, though -- I
have the same question about the usage of MAX() above.

> >   * @seq_ack_from_tap:	Last ACK number received from tap
> >   * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
> >   * @seq_ack_to_tap:	Last ACK number sent to tap
> > @@ -100,6 +101,7 @@ struct tcp_tap_conn {
> >  	uint16_t	wnd_to_tap;
> >  
> >  	uint32_t	seq_to_tap;
> > +	uint32_t	max_seq_to_tap;
> >  	uint32_t	seq_ack_from_tap;
> >  	uint32_t	seq_from_tap;
> >  	uint32_t	seq_ack_to_tap;  
> 

-- 
Stefano


      reply	other threads:[~2024-04-24 18:32 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-20 19:19 [PATCH 0/2] Support for SO_PEEK_OFF when a available Jon Maloy
2024-04-20 19:19 ` [PATCH 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-04-23 17:50   ` Stefano Brivio
2024-04-24  0:48     ` David Gibson
2024-04-24 18:30       ` Stefano Brivio
2024-04-26  3:27         ` David Gibson
2024-04-26  5:58           ` Stefano Brivio
2024-04-29  1:46             ` David Gibson
2024-04-25 23:06       ` Jon Maloy
2024-04-24  0:44   ` David Gibson
2024-04-25 23:23     ` Jon Maloy
2024-04-26  3:29       ` David Gibson
2024-04-20 19:19 ` [PATCH 2/2] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-04-24  1:04   ` David Gibson
2024-04-24 18:31     ` Stefano Brivio [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240424203136.3c142249@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=dgibson@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).