From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 0290D5A004C for ; Fri, 24 May 2024 19:27:02 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1716571622; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nDwWvA0JaZKnAuN0KODIlFJaliJCdSeMFefpxxOEVu4=; b=jWVrhOT75RtpvvSdgCngM0KGw6yVBZ86TA7GdqN1Wh8FSFHAyRtssk4yLSYJ6tgI8dCtB3 +VrSzvX+HbBOIqgzeUWNTBEn3gkgfi1WuyTMfxUxgptsTMiAqm7nEpTDcIos7ElyYJ5D5p vL8UU5nPmqnuCfrfTql48INeDkJHY84= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-630-oHm8yticPtaraj-6zowqtA-1; Fri, 24 May 2024 13:27:00 -0400 X-MC-Unique: oHm8yticPtaraj-6zowqtA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 46E2729AA381 for ; Fri, 24 May 2024 17:27:00 +0000 (UTC) Received: from jmaloy-thinkpadp16vgen1.rmtcaqc.csb (unknown [10.22.18.17]) by smtp.corp.redhat.com (Postfix) with ESMTP id C891E40C6CB4; Fri, 24 May 2024 17:26:59 +0000 (UTC) From: Jon Maloy To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com Subject: [PATCH v7 3/3] tcp: allow retransmit when peer receive window is zero Date: Fri, 24 May 2024 13:26:56 -0400 Message-ID: <20240524172656.193183-4-jmaloy@redhat.com> In-Reply-To: <20240524172656.193183-1-jmaloy@redhat.com> References: <20240524172656.193183-1-jmaloy@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Message-ID-Hash: JC6JFQGRLCMJZWBOIK5AVUSTKOMZ4I4X X-Message-ID-Hash: JC6JFQGRLCMJZWBOIK5AVUSTKOMZ4I4X X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by data packets arriving from this side. However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC-9293 recommendation that a previously advertised window never should shrink. RFC-9293 seems to permit that we can send up to the right edge of the last advertised non-zero window in such cases, so that is what we do to resolve this situation. However, we use the above mechanism only for timer-induced retransmits, while the fast-retransmit mechanism won't be affected by this change. It should be noted that although this solves the problem we have at hand, it is a work-around, and not a genuine solution to the described kernel bug. Signed-off-by: Jon Maloy --- tcp.c | 44 +++++++++++++++++++++++++++++++------------- tcp_conn.h | 2 ++ 2 files changed, 33 insertions(+), 13 deletions(-) diff --git a/tcp.c b/tcp.c index 01898f1..76df04e 100644 --- a/tcp.c +++ b/tcp.c @@ -1760,9 +1760,17 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_edge; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); + + /* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */ conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); + wnd_edge = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) + conn->seq_wnd_edge_from_tap = wnd_edge; + /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ } @@ -1795,6 +1803,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; } /** @@ -2201,15 +2210,16 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, * tcp_data_from_sock() - Handle new data from socket, queue to tap, in window * @c: Execution context * @conn: Connection pointer + * @wnd_edge: Right edge of window advertised from tap * * Return: negative on connection reset, 0 otherwise * * #syscalls recvmsg */ -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn, + uint32_t wnd_edge) { - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; - int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; + int max_send, fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int sendlen, len, dlen, v4 = CONN_V4(conn); int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; @@ -2228,19 +2238,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) tcp_set_peek_offset(s, 0); } - if (!wnd_scaled || already_sent >= wnd_scaled) { + /* How much can we read/send within current window ? */ + max_send = wnd_edge - conn->seq_to_tap; + if (max_send <= 0) { + flow_trace(conn, "Window full: right edge: %u, sent: %u", + wnd_edge, conn->seq_to_tap); + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; conn_flag(c, conn, STALLED); conn_flag(c, conn, ACK_FROM_TAP_DUE); return 0; } /* Set up buffer descriptors we'll fill completely and partially. */ - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); + fill_bufs = DIV_ROUND_UP(max_send, mss); if (fill_bufs > TCP_FRAMES) { fill_bufs = TCP_FRAMES; iov_rem = 0; } else { - iov_rem = (wnd_scaled - already_sent) % mss; + iov_rem = max_send % mss; } /* Prepare iov according to kernel capability */ @@ -2468,7 +2483,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, max_ack_seq, conn->seq_to_tap); conn->seq_to_tap = max_ack_seq; tcp_set_peek_offset(conn->sock, 0); - tcp_data_from_sock(c, conn); + tcp_data_from_sock(c, conn, conn->seq_wnd_edge_from_tap); } if (!iov_i) @@ -2565,7 +2580,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, /* The client might have sent data already, which we didn't * dequeue waiting for SYN,ACK from tap -- check now. */ - tcp_data_from_sock(c, conn); + tcp_data_from_sock(c, conn, conn->seq_wnd_edge_from_tap); tcp_send_flag(c, conn, ACK); } @@ -2658,7 +2673,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, tcp_tap_window_update(conn, ntohs(th->window)); - tcp_data_from_sock(c, conn); + tcp_data_from_sock(c, conn, conn->seq_wnd_edge_from_tap); if (p->count - idx == 1) return 1; @@ -2891,7 +2906,8 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; tcp_set_peek_offset(conn->sock, 0); - tcp_data_from_sock(c, conn); + tcp_data_from_sock(c, conn, + conn->seq_wnd_edge_from_tap); tcp_timer_ctl(c, conn); } } else { @@ -2945,9 +2961,11 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) if (events & (EPOLLRDHUP | EPOLLHUP)) conn_event(c, conn, SOCK_FIN_RCVD); - if (events & EPOLLIN) - tcp_data_from_sock(c, conn); - + if (events & EPOLLIN) { + tcp_data_from_sock(c, conn, conn->wnd_from_tap + ? conn->seq_wnd_edge_from_tap + : conn->seq_to_tap); + } if (events & EPOLLOUT) tcp_update_seqack_wnd(c, conn, 0, NULL); diff --git a/tcp_conn.h b/tcp_conn.h index 5f8c8fb..16228d8 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -30,6 +30,7 @@ * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap * @seq_ack_from_tap: Last ACK number received from tap + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap * @seq_from_tap: Next sequence for packets from tap (not actually sent) * @seq_ack_to_tap: Last ACK number sent to tap * @seq_init_from_tap: Initial sequence number from tap @@ -101,6 +102,7 @@ struct tcp_tap_conn { uint32_t seq_to_tap; uint32_t seq_ack_from_tap; + uint32_t seq_wnd_edge_from_tap; uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap; -- 2.45.0