From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 5568A5A030B for ; Wed, 1 May 2024 22:28:44 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1714595323; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2qklK3PqaOmFftwueHBHaJ6pkTM1Dm5bVaD5PsJ9Oc8=; b=O09r3c/KB/00T8AKxHV4n1Djp3OViKjitnIt82aLfQ7CRxe8TyzA1eEjVGYwxV9eSbg3Qm GBTqqr0BPOw6W7+U9Ae5ef2VSPcFCOntU+/cl84hfrQwJB1yAKv4ZVf/ZMD4dEJfomrYj0 nDQeHtb3I8B2yc2ay67xj5Jw4uqntu0= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-470-HIqj8yw6NJmgKq7CF0yOlQ-1; Wed, 01 May 2024 16:28:41 -0400 X-MC-Unique: HIqj8yw6NJmgKq7CF0yOlQ-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A2407812C4F for ; Wed, 1 May 2024 20:28:41 +0000 (UTC) Received: from fenrir.redhat.com (unknown [10.22.34.98]) by smtp.corp.redhat.com (Postfix) with ESMTP id 478441121312; Wed, 1 May 2024 20:28:41 +0000 (UTC) From: Jon Maloy To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com Subject: [PATCH v2 2/2] tcp: allow retransmit when peer receive window is zero Date: Wed, 1 May 2024 16:28:39 -0400 Message-ID: <20240501202839.3491885-3-jmaloy@redhat.com> In-Reply-To: <20240501202839.3491885-1-jmaloy@redhat.com> References: <20240501202839.3491885-1-jmaloy@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.3 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Message-ID-Hash: IEQAN7PR2C5H2OOHEJH6UNEG4ACKULFH X-Message-ID-Hash: IEQAN7PR2C5H2OOHEJH6UNEG4ACKULFH X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while he is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side. However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that RFC 9293 is violated, since the zero-advertisement shrinks the previously advertised window within which the dropped packet originally was sent. RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)" It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation. Signed-off-by: Jon Maloy --- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. --- tcp.c | 29 ++++++++++++++++++++--------- tcp_conn.h | 2 ++ 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/tcp.c b/tcp.c index 535f1a2..703c62f 100644 --- a/tcp.c +++ b/tcp.c @@ -1749,9 +1749,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_upper; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); + wnd_upper = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_upper, conn->seq_wup_from_tap)) + conn->seq_wup_from_tap = wnd_upper; + /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ } @@ -1784,6 +1790,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + conn->seq_wup_from_tap = conn->seq_to_tap; } /** @@ -2133,9 +2140,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, * * #syscalls recvmsg */ -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn, uint32_t wnd_scaled) { - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int sendlen, len, plen, v4 = CONN_V4(conn); int s = conn->sock, i, ret = 0; @@ -2282,6 +2288,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, uint32_t max_ack_seq = conn->seq_ack_from_tap; uint32_t seq_from_tap = conn->seq_from_tap; struct msghdr mh = { .msg_iov = tcp_iov }; + uint32_t wnd; size_t len; ssize_t n; @@ -2325,8 +2332,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, SEQ_GE(ack_seq, max_ack_seq)) { /* Fast re-transmit */ retr = !len && !th->fin && - ack_seq == max_ack_seq && - ntohs(th->window) == max_ack_seq_wnd; + ack_seq == max_ack_seq; max_ack_seq_wnd = ntohs(th->window); max_ack_seq = ack_seq; @@ -2400,7 +2406,8 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_ack_from_tap = max_ack_seq; conn->seq_to_tap = max_ack_seq; set_peek_offset(conn, 0); - tcp_data_from_sock(c, conn); + wnd = conn->seq_wup_from_tap - max_ack_seq; + tcp_data_from_sock(c, conn, wnd); /* Empty queue before any POLL event tries to send it again */ tcp_l2_data_buf_flush(c); @@ -2500,7 +2507,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, /* The client might have sent data already, which we didn't * dequeue waiting for SYN,ACK from tap -- check now. */ - tcp_data_from_sock(c, conn); + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap); tcp_send_flag(c, conn, ACK); } @@ -2593,7 +2600,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, tcp_tap_window_update(conn, ntohs(th->window)); - tcp_data_from_sock(c, conn); + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap); if (p->count - idx == 1) return 1; @@ -2776,6 +2783,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) { struct itimerspec check_armed = { { 0 }, { 0 } }; struct tcp_tap_conn *conn = CONN(ref.flow); + uint32_t wnd; if (c->no_tcp) return; @@ -2807,7 +2815,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) conn->seq_to_tap = conn->seq_ack_from_tap; set_peek_offset(conn, 0); tcp_l2_data_buf_flush(c); - tcp_data_from_sock(c, conn); + + /* RFC 9293, 3.8.6: Send 1 byte of new data if needed */ + wnd = conn->seq_wup_from_tap - conn->seq_ack_from_tap; + tcp_data_from_sock(c, conn, MAX(1, wnd)); tcp_l2_data_buf_flush(c); tcp_timer_ctl(c, conn); } @@ -2863,7 +2874,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) conn_event(c, conn, SOCK_FIN_RCVD); if (events & EPOLLIN) - tcp_data_from_sock(c, conn); + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap); if (events & EPOLLOUT) tcp_update_seqack_wnd(c, conn, 0, NULL); diff --git a/tcp_conn.h b/tcp_conn.h index a5f5cfe..d6b627a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -30,6 +30,7 @@ * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap * @seq_ack_from_tap: Last ACK number received from tap + * @seq_wup_from_tap: Right edge (+1) of last non-zero window from tap * @seq_from_tap: Next sequence for packets from tap (not actually sent) * @seq_ack_to_tap: Last ACK number sent to tap * @seq_init_from_tap: Initial sequence number from tap @@ -101,6 +102,7 @@ struct tcp_tap_conn { uint32_t seq_to_tap; uint32_t seq_ack_from_tap; + uint32_t seq_wup_from_tap; uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap; -- 2.42.0