public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Jon Maloy <jmaloy@redhat.com>
To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
	dgibson@redhat.com, jmaloy@redhat.com
Subject: [PATCH v2 2/2] tcp: allow retransmit when peer receive window is zero
Date: Wed,  1 May 2024 16:28:39 -0400	[thread overview]
Message-ID: <20240501202839.3491885-3-jmaloy@redhat.com> (raw)
In-Reply-To: <20240501202839.3491885-1-jmaloy@redhat.com>

A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the peer, while he is unable to send out window updates even after
reads have freed up enough buffer space to permit a larger window.
In this situation, new window advertisemnts from the peer can only be
triggered by packets arriving from this side.

However, such packets are never sent, because the zero-window condition
currently prevents this side from sending out any packets whatsoever
to the peer.

We notice that the above bug is triggered *only* after the peer has
dropped an arriving packet because of severe memory squeeze, and that we
hence always enter a retransmission situation when this occurs. This
also means that RFC 9293 is violated, since the zero-advertisement
shrinks the previously advertised window within which the dropped packet
originally was sent.

RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
the following statement:
"A TCP receiver SHOULD NOT shrink the window, i.e., move the right
window edge to the left (SHLD-14). However, a sending TCP peer MUST
be robust against window shrinking, which may cause the
"usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).

If this happens, the sender SHOULD NOT send new data (SHLD-15), but
SHOULD retransmit normally the old unacknowledged data between SND.UNA
and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
beyond SND.UNA+SND.WND (MAY-7)"

It should be noted that although this solves the problem we have at
hand, it is not a genuine solution to the kernel bug. There may well
be TCP stacks around in other OS-es which don't do this, nor have
keep-alive probing as an alternatve way to solve the situation.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v2: - Using previously advertised window during retransmission, instead
      highest send sequencece number in the cycle.
---
 tcp.c      | 29 ++++++++++++++++++++---------
 tcp_conn.h |  2 ++
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/tcp.c b/tcp.c
index 535f1a2..703c62f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1749,9 +1749,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
  */
 static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
 {
+	uint32_t wnd_upper;
+
 	wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
 	conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
 
+	wnd_upper = conn->seq_ack_from_tap + wnd;
+	if (wnd && SEQ_GT(wnd_upper, conn->seq_wup_from_tap))
+		conn->seq_wup_from_tap = wnd_upper;
+
 	/* FIXME: reflect the tap-side receiver's window back to the sock-side
 	 * sender by adjusting SO_RCVBUF? */
 }
@@ -1784,6 +1790,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
 	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
+	conn->seq_wup_from_tap = conn->seq_to_tap;
 }
 
 /**
@@ -2133,9 +2140,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
  *
  * #syscalls recvmsg
  */
-static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn, uint32_t wnd_scaled)
 {
-	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
 	int sendlen, len, plen, v4 = CONN_V4(conn);
 	int s = conn->sock, i, ret = 0;
@@ -2282,6 +2288,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 	uint32_t max_ack_seq = conn->seq_ack_from_tap;
 	uint32_t seq_from_tap = conn->seq_from_tap;
 	struct msghdr mh = { .msg_iov = tcp_iov };
+	uint32_t wnd;
 	size_t len;
 	ssize_t n;
 
@@ -2325,8 +2332,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			    SEQ_GE(ack_seq, max_ack_seq)) {
 				/* Fast re-transmit */
 				retr = !len && !th->fin &&
-				       ack_seq == max_ack_seq &&
-				       ntohs(th->window) == max_ack_seq_wnd;
+				       ack_seq == max_ack_seq;
 
 				max_ack_seq_wnd = ntohs(th->window);
 				max_ack_seq = ack_seq;
@@ -2400,7 +2406,8 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 		conn->seq_ack_from_tap = max_ack_seq;
 		conn->seq_to_tap = max_ack_seq;
 		set_peek_offset(conn, 0);
-		tcp_data_from_sock(c, conn);
+		wnd = conn->seq_wup_from_tap - max_ack_seq;
+		tcp_data_from_sock(c, conn, wnd);
 
 		/* Empty queue before any POLL event tries to send it again */
 		tcp_l2_data_buf_flush(c);
@@ -2500,7 +2507,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
 	/* The client might have sent data already, which we didn't
 	 * dequeue waiting for SYN,ACK from tap -- check now.
 	 */
-	tcp_data_from_sock(c, conn);
+	tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
 	tcp_send_flag(c, conn, ACK);
 }
 
@@ -2593,7 +2600,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 
 		tcp_tap_window_update(conn, ntohs(th->window));
 
-		tcp_data_from_sock(c, conn);
+		tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
 
 		if (p->count - idx == 1)
 			return 1;
@@ -2776,6 +2783,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 {
 	struct itimerspec check_armed = { { 0 }, { 0 } };
 	struct tcp_tap_conn *conn = CONN(ref.flow);
+	uint32_t wnd;
 
 	if (c->no_tcp)
 		return;
@@ -2807,7 +2815,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 			conn->seq_to_tap = conn->seq_ack_from_tap;
 			set_peek_offset(conn, 0);
 			tcp_l2_data_buf_flush(c);
-			tcp_data_from_sock(c, conn);
+
+			/* RFC 9293, 3.8.6: Send 1 byte of new data if needed */
+			wnd = conn->seq_wup_from_tap - conn->seq_ack_from_tap;
+			tcp_data_from_sock(c, conn, MAX(1, wnd));
 			tcp_l2_data_buf_flush(c);
 			tcp_timer_ctl(c, conn);
 		}
@@ -2863,7 +2874,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
 			conn_event(c, conn, SOCK_FIN_RCVD);
 
 		if (events & EPOLLIN)
-			tcp_data_from_sock(c, conn);
+			tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap);
 
 		if (events & EPOLLOUT)
 			tcp_update_seqack_wnd(c, conn, 0, NULL);
diff --git a/tcp_conn.h b/tcp_conn.h
index a5f5cfe..d6b627a 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -30,6 +30,7 @@
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
  * @seq_ack_from_tap:	Last ACK number received from tap
+ * @seq_wup_from_tap:	Right edge (+1) of last non-zero window from tap
  * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
  * @seq_ack_to_tap:	Last ACK number sent to tap
  * @seq_init_from_tap:	Initial sequence number from tap
@@ -101,6 +102,7 @@ struct tcp_tap_conn {
 
 	uint32_t	seq_to_tap;
 	uint32_t	seq_ack_from_tap;
+	uint32_t	seq_wup_from_tap;
 	uint32_t	seq_from_tap;
 	uint32_t	seq_ack_to_tap;
 	uint32_t	seq_init_from_tap;
-- 
@@ -30,6 +30,7 @@
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
  * @seq_ack_from_tap:	Last ACK number received from tap
+ * @seq_wup_from_tap:	Right edge (+1) of last non-zero window from tap
  * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
  * @seq_ack_to_tap:	Last ACK number sent to tap
  * @seq_init_from_tap:	Initial sequence number from tap
@@ -101,6 +102,7 @@ struct tcp_tap_conn {
 
 	uint32_t	seq_to_tap;
 	uint32_t	seq_ack_from_tap;
+	uint32_t	seq_wup_from_tap;
 	uint32_t	seq_from_tap;
 	uint32_t	seq_ack_to_tap;
 	uint32_t	seq_init_from_tap;
-- 
2.42.0


  parent reply	other threads:[~2024-05-01 20:28 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 20:28 [PATCH v2 0/2] SO_PEEK_OFF support Jon Maloy
2024-05-01 20:28 ` [PATCH v2 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-05-02  1:31   ` David Gibson
2024-05-03 13:42     ` Stefano Brivio
2024-05-03 14:43       ` Jon Maloy
2024-05-06  7:15         ` David Gibson
2024-05-06  6:51       ` David Gibson
2024-05-01 20:28 ` Jon Maloy [this message]
2024-05-03 13:43   ` [PATCH v2 2/2] tcp: allow retransmit when peer receive window is zero Stefano Brivio
2024-05-03 15:30     ` Jon Maloy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240501202839.3491885-3-jmaloy@redhat.com \
    --to=jmaloy@redhat.com \
    --cc=dgibson@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).