From: Jon Maloy <jmaloy@redhat.com>
To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
dgibson@redhat.com, jmaloy@redhat.com
Subject: [PATCH v6 3/3] tcp: allow retransmit when peer receive window is zero
Date: Fri, 17 May 2024 11:06:02 -0400 [thread overview]
Message-ID: <20240517150602.1186203-4-jmaloy@redhat.com> (raw)
In-Reply-To: <20240517150602.1186203-1-jmaloy@redhat.com>
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the peer, while it is unable to send out window updates even after
reads have freed up enough buffer space to permit a larger window.
In this situation, new window advertisemnts from the peer can only be
triggered by packets arriving from this side.
However, such packets are never sent, because the zero-window condition
currently prevents this side from sending out any packets whatsoever
to the peer.
We notice that the above bug is triggered *only* after the peer has
dropped an arriving packet because of severe memory squeeze, and that we
hence always enter a retransmission situation when this occurs. This
also means that it goes against the RFC 9293 recommendation that a
previously advertised window never should shrink.
RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find
the following statement:
"A TCP receiver SHOULD NOT shrink the window, i.e., move the right
window edge to the left (SHLD-14). However, a sending TCP peer MUST
be robust against window shrinking, which may cause the
"usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).
If this happens, the sender SHOULD NOT send new data (SHLD-15), but
SHOULD retransmit normally the old unacknowledged data between SND.UNA
and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data
beyond SND.UNA+SND.WND (MAY-7)"
We never see the window become negative, but we interpret this as a
recommendation to use the previously available window during
retransmission even when the currently advertised window is zero.
We use the above mechanism only for timer-induced retransmits, while
the fast-retransmit mechanism won't trigger on this condition.
It should be noted that although this solves the problem we have at
hand, it is not a genuine solution to the kernel bug. There may well
be TCP stacks around in other OS-es which don't do this, nor have
keep-alive probing as an alternatve way to solve the situation.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v2: - Using previously advertised window during retransmission, instead
highest send sequencece number in the cycle.
v3: - Rebased to newest code
- Changes based on feedback from PASST team
- Sending out empty probe message at timer expiration when
we are not in retransmit situation.
v4: - Some small changes based on feedback from PASST team.
- Replaced fast retransmit with a one-time 'fast probe' when
window is zero.
v5: - Gave up on 'fast probing' for now. When I got the sequence
numbers right in the flag message (after having emptied the tap
queue), it turns out an empty message does *not* force a new peer
window update as was my previous understanding of the code.
- Added cppcheck suppression line (which I was unable to verify)
as suggested by S. Brivio.
- Removed sending of empty probe when window from tap side is zero.
It looks pointless at the moment, at least for solving the above
described situation.
v6: - Ensure that arrival of new data doesn´t cause us to ignore a
zero-window situation.
- Removed the pointless probing referred to in v5 comment.
---
tcp.c | 26 ++++++++++++++++++++------
tcp_conn.h | 2 ++
2 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/tcp.c b/tcp.c
index fa13292..38c3480 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1764,9 +1764,17 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
*/
static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
{
+ uint32_t wnd_edge;
+
wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
+ /* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */
+
conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
+ wnd_edge = conn->seq_ack_from_tap + wnd;
+ if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))
+ conn->seq_wnd_edge_from_tap = wnd_edge;
+
/* FIXME: reflect the tap-side receiver's window back to the sock-side
* sender by adjusting SO_RCVBUF? */
}
@@ -1799,6 +1807,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
+ conn->seq_wnd_edge_from_tap = conn->seq_to_tap;
}
/**
@@ -2208,13 +2217,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
*/
static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
{
- uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
int sendlen, len, dlen, v4 = CONN_V4(conn);
+ uint32_t already_sent, max_send, seq;
int s = conn->sock, i, ret = 0;
struct msghdr mh_sock = { 0 };
uint16_t mss = MSS_GET(conn);
- uint32_t already_sent, seq;
struct iovec *iov;
/* How much have we read/sent since last received ack ? */
@@ -2228,19 +2236,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
tcp_set_peek_offset(s, 0);
}
- if (!wnd_scaled || already_sent >= wnd_scaled) {
+ /* How much are we still allowed to send within current window ? */
+ max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap;
+ if (SEQ_LE(max_send, 0)) {
+ flow_trace(conn, "Window full: right edge: %u, sent: %u",
+ conn->seq_wnd_edge_from_tap, conn->seq_to_tap);
+ conn->seq_wnd_edge_from_tap = conn->seq_to_tap;
conn_flag(c, conn, STALLED);
conn_flag(c, conn, ACK_FROM_TAP_DUE);
return 0;
}
/* Set up buffer descriptors we'll fill completely and partially. */
- fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss);
+ fill_bufs = DIV_ROUND_UP(max_send, mss);
if (fill_bufs > TCP_FRAMES) {
fill_bufs = TCP_FRAMES;
iov_rem = 0;
} else {
- iov_rem = (wnd_scaled - already_sent) % mss;
+ iov_rem = max_send % mss;
}
/* Prepare iov according to kernel capability */
@@ -2347,6 +2360,7 @@ err:
*
* Return: count of consumed packets
*/
+
static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
const struct pool *p, int idx)
{
@@ -2950,7 +2964,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
if (events & (EPOLLRDHUP | EPOLLHUP))
conn_event(c, conn, SOCK_FIN_RCVD);
- if (events & EPOLLIN)
+ if (events & EPOLLIN && conn->wnd_from_tap)
tcp_data_from_sock(c, conn);
if (events & EPOLLOUT)
diff --git a/tcp_conn.h b/tcp_conn.h
index d280b22..5cbad2a 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -30,6 +30,7 @@
* @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
* @seq_to_tap: Next sequence for packets to tap
* @seq_ack_from_tap: Last ACK number received from tap
+ * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap
* @seq_from_tap: Next sequence for packets from tap (not actually sent)
* @seq_ack_to_tap: Last ACK number sent to tap
* @seq_init_from_tap: Initial sequence number from tap
@@ -101,6 +102,7 @@ struct tcp_tap_conn {
uint32_t seq_to_tap;
uint32_t seq_ack_from_tap;
+ uint32_t seq_wnd_edge_from_tap;
uint32_t seq_from_tap;
uint32_t seq_ack_to_tap;
uint32_t seq_init_from_tap;
--
@@ -30,6 +30,7 @@
* @wnd_to_tap: Sending window advertised to tap, unscaled (as sent)
* @seq_to_tap: Next sequence for packets to tap
* @seq_ack_from_tap: Last ACK number received from tap
+ * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap
* @seq_from_tap: Next sequence for packets from tap (not actually sent)
* @seq_ack_to_tap: Last ACK number sent to tap
* @seq_init_from_tap: Initial sequence number from tap
@@ -101,6 +102,7 @@ struct tcp_tap_conn {
uint32_t seq_to_tap;
uint32_t seq_ack_from_tap;
+ uint32_t seq_wnd_edge_from_tap;
uint32_t seq_from_tap;
uint32_t seq_ack_to_tap;
uint32_t seq_init_from_tap;
--
2.42.0
next prev parent reply other threads:[~2024-05-17 15:06 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-17 15:05 [PATCH v6 0/3] Support for SO_PEEK_OFF socket option Jon Maloy
2024-05-17 15:06 ` [PATCH v6 1/3] tcp: move seq_to_tap update to when frame is queued Jon Maloy
2024-05-17 15:06 ` [PATCH v6 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Jon Maloy
2024-05-17 15:06 ` Jon Maloy [this message]
2024-05-17 15:24 [PATCH v6 0/3] Support for SO_PEEK_OFF socket option Jon Maloy
2024-05-17 15:24 ` [PATCH v6 3/3] tcp: allow retransmit when peer receive window is zero Jon Maloy
2024-05-21 5:51 ` David Gibson
2024-05-21 22:25 ` Jon Maloy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240517150602.1186203-4-jmaloy@redhat.com \
--to=jmaloy@redhat.com \
--cc=dgibson@redhat.com \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).