public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Jon Maloy <jmaloy@redhat.com>
To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
	dgibson@redhat.com, jmaloy@redhat.com
Subject: [PATCH v3 3/3] tcp: allow retransmit when peer receive window is zero
Date: Sat, 11 May 2024 11:20:08 -0400	[thread overview]
Message-ID: <20240511152008.421750-4-jmaloy@redhat.com> (raw)
In-Reply-To: <20240511152008.421750-1-jmaloy@redhat.com>

A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the peer, while it is unable to send out window updates even after
reads have freed up enough buffer space to permit a larger window.
In this situation, new window advertisemnts from the peer can only be
triggered by packets arriving from this side.

However, such packets are never sent, because the zero-window condition
currently prevents this side from sending out any packets whatsoever
to the peer.

We notice that the above bug is triggered *only* after the peer has
dropped an arriving packet because of severe memory squeeze, and that we
hence always enter a retransmission situation when this occurs. This
also means that it goes against the RFC 9293 recommendation that a
previously advertised window never should shrink.

RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
the following statement:
"A TCP receiver SHOULD NOT shrink the window, i.e., move the right
window edge to the left (SHLD-14). However, a sending TCP peer MUST
be robust against window shrinking, which may cause the
"usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).

If this happens, the sender SHOULD NOT send new data (SHLD-15), but
SHOULD retransmit normally the old unacknowledged data between SND.UNA
and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
beyond SND.UNA+SND.WND (MAY-7)"

We never see the window become negative, but we interpret this as a
recommendation to use the previously available window during
retransmission even when the currently advertised window is zero.

In case of a zero-window non-retransmission situation where there
is no new data to be sent, we also add a simple zero-window probing
feature. By sending an empty packet at regular timeout events we
resolve the situation described above, since the peer receives the
necessary trigger to advertise its window once it becomes non-zero
again.

It should be noted that although this solves the problem we have at
hand, it is not a genuine solution to the kernel bug. There may well
be TCP stacks around in other OS-es which don't do this, nor have
keep-alive probing as an alternatve way to solve the situation.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v2: - Using previously advertised window during retransmission, instead
      highest send sequencece number in the cycle.
v3: - Rebased to newest code
    - Changes based on feedback from PASST team
    - Sending out empty probe message at timer expiration when
      we are not in retransmit situation.
---
 tcp.c      | 30 +++++++++++++++++++++---------
 tcp_conn.h |  2 ++
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/tcp.c b/tcp.c
index 8297812..bd6bf35 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1774,9 +1774,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
  */
 static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
 {
+	uint32_t wnd_upper;
+
 	wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
 	conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
 
+	wnd_upper = conn->seq_ack_from_tap + wnd;
+	if (wnd && SEQ_GT(wnd_upper, conn->seq_wup_from_tap))
+		conn->seq_wup_from_tap = wnd_upper;
+
 	/* FIXME: reflect the tap-side receiver's window back to the sock-side
 	 * sender by adjusting SO_RCVBUF? */
 }
@@ -1809,6 +1815,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
 	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
+	conn->seq_wup_from_tap = conn->seq_to_tap;
 }
 
 /**
@@ -2220,7 +2227,6 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
  */
 static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 {
-	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
 	int sendlen, len, dlen, v4 = CONN_V4(conn);
 	uint32_t max_send, seq, already_sent;
@@ -2241,10 +2247,11 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	/* How much are we still allowed to send within current window ? */
-	max_send = conn->seq_ack_from_tap + wnd_scaled - conn->seq_to_tap;
+	max_send = conn->seq_wup_from_tap - conn->seq_to_tap;
 	if (SEQ_LE(max_send, 0)) {
-		flow_trace(conn, "Empty window: win: %u, sent: %u",
-			   wnd_scaled, conn->seq_to_tap);
+		flow_trace(conn, "Empty window: win_upper: %u, sent: %u",
+			   conn->seq_wup_from_tap, conn->seq_to_tap);
+		conn->seq_wup_from_tap = conn->seq_to_tap;
 		conn_flag(c, conn, STALLED);
 		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 		return 0;
@@ -2380,7 +2387,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 	ASSERT(conn->events & ESTABLISHED);
 
 	for (i = idx, iov_i = 0; i < (int)p->count; i++) {
-		uint32_t seq, seq_offset, ack_seq;
+		uint32_t seq, seq_offset, ack_seq, wnd;
 		const struct tcphdr *th;
 		char *data;
 		size_t off;
@@ -2413,11 +2420,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) &&
 			    SEQ_GE(ack_seq, max_ack_seq)) {
 				/* Fast re-transmit */
+				wnd = ntohs(th->window);
 				retr = !len && !th->fin &&
 				       ack_seq == max_ack_seq &&
-				       ntohs(th->window) == max_ack_seq_wnd;
+				       (wnd == max_ack_seq_wnd || !wnd);
 
-				max_ack_seq_wnd = ntohs(th->window);
+				max_ack_seq_wnd = wnd;
 				max_ack_seq = ack_seq;
 			}
 		}
@@ -2480,8 +2488,9 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 
 	if (retr) {
 		flow_trace(conn,
-			   "fast re-transmit, ACK: %u, previous sequence: %u",
-			   max_ack_seq, conn->seq_to_tap);
+			   "fast re-transmit, seqno %u -> %u, win_upper: %u",
+			   conn->seq_to_tap, max_ack_seq,
+			   conn->seq_wup_from_tap);
 
 		conn->seq_to_tap = max_ack_seq;
 		tcp_set_peek_offset(conn->sock, 0);
@@ -2931,6 +2940,9 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 			flow_dbg(conn, "activity timeout");
 			tcp_rst(c, conn);
 		}
+		/* No data sent recently? Keep connection alive. */
+		if (conn->seq_to_tap == conn->seq_ack_from_tap)
+			tcp_send_flag(c, conn, ACK_IF_NEEDED);
 	}
 }
 
diff --git a/tcp_conn.h b/tcp_conn.h
index d280b22..8ae20ef 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -30,6 +30,7 @@
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
  * @seq_ack_from_tap:	Last ACK number received from tap
+ * @seq_wup_from_tap:	Right edge of last non-zero window from tap
  * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
  * @seq_ack_to_tap:	Last ACK number sent to tap
  * @seq_init_from_tap:	Initial sequence number from tap
@@ -101,6 +102,7 @@ struct tcp_tap_conn {
 
 	uint32_t	seq_to_tap;
 	uint32_t	seq_ack_from_tap;
+	uint32_t	seq_wup_from_tap;
 	uint32_t	seq_from_tap;
 	uint32_t	seq_ack_to_tap;
 	uint32_t	seq_init_from_tap;
-- 
@@ -30,6 +30,7 @@
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
  * @seq_ack_from_tap:	Last ACK number received from tap
+ * @seq_wup_from_tap:	Right edge of last non-zero window from tap
  * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
  * @seq_ack_to_tap:	Last ACK number sent to tap
  * @seq_init_from_tap:	Initial sequence number from tap
@@ -101,6 +102,7 @@ struct tcp_tap_conn {
 
 	uint32_t	seq_to_tap;
 	uint32_t	seq_ack_from_tap;
+	uint32_t	seq_wup_from_tap;
 	uint32_t	seq_from_tap;
 	uint32_t	seq_ack_to_tap;
 	uint32_t	seq_init_from_tap;
-- 
2.42.0


  parent reply	other threads:[~2024-05-11 15:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-11 15:20 [PATCH v3 0/3] Support for SO_PEEK_OFF socket option Jon Maloy
2024-05-11 15:20 ` [PATCH v3 1/3] tcp: move seq_to_tap update to when frame is queued Jon Maloy
2024-05-13  2:09   ` David Gibson
2024-05-14 16:48   ` Stefano Brivio
2024-05-11 15:20 ` [PATCH v3 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-05-13  2:23   ` David Gibson
2024-05-14 17:22   ` Stefano Brivio
2024-05-14 20:06     ` Jon Maloy
2024-05-14 21:00       ` Stefano Brivio
2024-05-11 15:20 ` Jon Maloy [this message]
2024-05-14 17:46   ` [PATCH v3 3/3] tcp: allow retransmit when peer receive window is zero Stefano Brivio
2024-05-14 20:19     ` Jon Maloy
2024-05-14 21:09       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240511152008.421750-4-jmaloy@redhat.com \
    --to=jmaloy@redhat.com \
    --cc=dgibson@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).