From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 9EA765A02EF for ; Wed, 15 May 2024 22:25:08 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715804707; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ImHNqWfV4MYCopWf6WR5TO2uUXd/yni6uuAWne9lSS8=; b=XJyUjJIMdYB2LUGUH5zvFQbUGQyfWfIPC7Hc4NBkogjbHsHJ1D9VqALXqCTXRIBNANsZXi pglnbRh+i+DlQAmwTcDIlFxuKSS20LAnJvUanNHWM879o0M9fSNLSbAo6n0ycDk3i4fyVB QG/0s1b/2faaWHtDu/2WR33nbcUCZyY= Received: from mail-ej1-f72.google.com (mail-ej1-f72.google.com [209.85.218.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-397-ugD0BEfjPrqo_73FKGJZ3g-1; Wed, 15 May 2024 16:24:55 -0400 X-MC-Unique: ugD0BEfjPrqo_73FKGJZ3g-1 Received: by mail-ej1-f72.google.com with SMTP id a640c23a62f3a-a5a05c1efa7so248005166b.1 for ; Wed, 15 May 2024 13:24:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715804692; x=1716409492; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ImHNqWfV4MYCopWf6WR5TO2uUXd/yni6uuAWne9lSS8=; b=Q1X5cFpe+7pJITbvHK8w5smhJ7ck7w8ANzUwNp6kxD9xls9FT0GK5+SjnznripTGjZ eMBQVxTok++6N0YatEp+z7pGpSevWW8bKDJbMU8Q2UB1YNtga9aihpyz9P6uSto+YzkD +tNMMsTWIXDFtFikbPsuR3SJgNG0UdNjdgOOYUXNOENc5cF0/s2QNmfCp4scZqLtv9mT LIVwcxr7u/ZWYpl4j+1fqH89w5KjqsmT+5BckrEBvl3vG95MSho384g1RbjkI3Elnf09 D9aEkR69I3Dxw03jOwQE8+FjetcUVGiYtWSYd4e7qKQENgsp8AixvpjbAY81/tgumtM2 9lxw== X-Gm-Message-State: AOJu0Yztymub82GGfXxdYOnuRedtpLqmyGURW/Jzag30ChhmWw9OLcS6 o/IS0s/lvW6Nc8QyJYOA5nf2yUyxfflV0GmklgZJOezV9erdfyWZSSeEBuEGgPcWIjSRxzk73bS 1zAT16sf74mfD4AVI+ztVixscSXMD60IPFefuV/dMXhmMSzTVb0xPpjgocehCwvSGBv0Jm638Zt DC7eFaJgCSicOlpCdezw+71axTPjeubGx9KFQ= X-Received: by 2002:a50:935b:0:b0:574:ecec:1887 with SMTP id 4fb4d7f45d1cf-574ecec19camr4083203a12.32.1715804692155; Wed, 15 May 2024 13:24:52 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGRIVQhbI9J+RhKG+5jQUkcDe34JcnNpSgK88rgdWy5BoI5GaB/xOmMRTfVi1tSlK9ra0q6cQ== X-Received: by 2002:a50:935b:0:b0:574:ecec:1887 with SMTP id 4fb4d7f45d1cf-574ecec19camr4083178a12.32.1715804691564; Wed, 15 May 2024 13:24:51 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5733c2cb331sm9573847a12.67.2024.05.15.13.24.50 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 15 May 2024 13:24:50 -0700 (PDT) Date: Wed, 15 May 2024 22:24:17 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero Message-ID: <20240515222417.72ce256b@elisabeth> In-Reply-To: <20240515153429.859185-4-jmaloy@redhat.com> References: <20240515153429.859185-1-jmaloy@redhat.com> <20240515153429.859185-4-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: 4TVTVB2OLJESKOOYTUMLLWNHWBFY2CBF X-Message-ID-Hash: 4TVTVB2OLJESKOOYTUMLLWNHWBFY2CBF X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 15 May 2024 11:34:29 -0400 Jon Maloy wrote: > A bug in kernel TCP may lead to a deadlock where a zero window is sent > from the peer, while it is unable to send out window updates even after > reads have freed up enough buffer space to permit a larger window. > In this situation, new window advertisemnts from the peer can only be > triggered by packets arriving from this side. > > However, such packets are never sent, because the zero-window condition > currently prevents this side from sending out any packets whatsoever > to the peer. > > We notice that the above bug is triggered *only* after the peer has > dropped an arriving packet because of severe memory squeeze, and that we > hence always enter a retransmission situation when this occurs. This > also means that it goes against the RFC 9293 recommendation that a > previously advertised window never should shrink. > > RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find > the following statement: > "A TCP receiver SHOULD NOT shrink the window, i.e., move the right > window edge to the left (SHLD-14). However, a sending TCP peer MUST > be robust against window shrinking, which may cause the > "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). > > If this happens, the sender SHOULD NOT send new data (SHLD-15), but > SHOULD retransmit normally the old unacknowledged data between SND.UNA > and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data > beyond SND.UNA+SND.WND (MAY-7)" > > We never see the window become negative, but we interpret this as a > recommendation to use the previously available window during > retransmission even when the currently advertised window is zero. > > We use the above mechanism only at timer-induced retransmits. > In the case we receive duplicate ack and a zero window, but > still know we have outstanding data acks waiting, we send out an > empty "fast probe" instead of doing fast retransmit. This averts > the risk of overwhelming a memory squeezed peer with retransmits, > while still forcing it to send out a new window update when the > probe is received. This entails a theoretical risk of redundant > retransmits from the peer, but that is a risk worth taking. > > In case of a zero-window non-retransmission situation where there > is no new data to be sent, we also add a simple zero-window probing > feature. By sending an empty packet at regular timeout events we > resolve the situation described above, since the peer receives the > necessary trigger to advertise its window once it becomes non-zero > again. > > It should be noted that although this solves the problem we have at > hand, it is not a genuine solution to the kernel bug. There may well > be TCP stacks around in other OS-es which don't do this, nor have > keep-alive probing as an alternatve way to solve the situation. > > Signed-off-by: Jon Maloy > > --- > v2: - Using previously advertised window during retransmission, instead > highest send sequencece number in the cycle. > v3: - Rebased to newest code > - Changes based on feedback from PASST team > - Sending out empty probe message at timer expiration when > we are not in retransmit situation. > v4: - Some small changes based on feedback from PASST team. > - Replaced fast retransmit with a one-time 'fast probe' when > window is zero. > --- > tcp.c | 32 +++++++++++++++++++++++++++----- > tcp_conn.h | 2 ++ > 2 files changed, 29 insertions(+), 5 deletions(-) > > diff --git a/tcp.c b/tcp.c > index 4163bf9..a33f494 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, > */ > static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) > { > + uint32_t wnd_edge; > + > wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); > conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); > > + wnd_edge = conn->seq_ack_from_tap + wnd; > + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) Here, cppcheck ('make cppcheck') says: tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrueFalse] if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^ tcp.c:1766:8: note: Assignment 'wnd=((1<<(16+8))<(wnd<ws_from_tap))?(1<<(16+8)):(wnd<ws_from_tap)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); ^ tcp.c:1770:6: note: Condition 'wnd' is always true if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^ See the comment in tcp_update_seqack_wnd() and related suppression. It's clearly a false positive (if you omit the MIN() macro, it goes away), so we need that same suppression here. > + conn->seq_wnd_edge_from_tap = wnd_edge; > + > /* FIXME: reflect the tap-side receiver's window back to the sock-side > * sender by adjusting SO_RCVBUF? */ > } > @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, > ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; > > conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; > + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; > } > > /** > @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, > */ > static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > { > - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; > int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; > int sendlen, len, dlen, v4 = CONN_V4(conn); > + uint32_t already_sent, max_send, seq; > int s = conn->sock, i, ret = 0; > struct msghdr mh_sock = { 0 }; > uint16_t mss = MSS_GET(conn); > - uint32_t already_sent, seq; > struct iovec *iov; > > /* How much have we read/sent since last received ack ? */ > @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > tcp_set_peek_offset(s, 0); > } > > - if (!wnd_scaled || already_sent >= wnd_scaled) { > + /* How much are we still allowed to send within current window ? */ > + max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap; > + if (SEQ_LE(max_send, 0)) { > + flow_trace(conn, "Empty window: win_upper: %u, sent: %u", This is not win_upper anymore, and the window is actually full rather than empty (it's... empty of space). Maybe: flow_trace(conn, "Window full: right edge: %u, sent: %u" > + conn->seq_wnd_edge_from_tap, conn->seq_to_tap); > + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; > conn_flag(c, conn, STALLED); > conn_flag(c, conn, ACK_FROM_TAP_DUE); > return 0; > } > > /* Set up buffer descriptors we'll fill completely and partially. */ > - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); > + fill_bufs = DIV_ROUND_UP(max_send, mss); > if (fill_bufs > TCP_FRAMES) { > fill_bufs = TCP_FRAMES; > iov_rem = 0; > } else { > - iov_rem = (wnd_scaled - already_sent) % mss; > + iov_rem = max_send % mss; > } > > /* Prepare iov according to kernel capability */ > @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, > conn->seq_to_tap = max_ack_seq; > tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > + } else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_seq)) { > + /* Force peer to send new advertisement now, but only once */ Two questions: - which advertisement? We're sending a zero-window probe, not forcing the peer to do much really. I would rather just state that we're sending a probe - what guarantees it only happens once? If we get more data from the socket, we'll get again SEQ_GT(conn->seq_to_tap, max_ack_seq) in a bit, and send another ACK (duplicate) to the peer, without the peer necessarily ever advertising a non-zero window meanwhile. I'm struggling a bit to understand how this can work "cleanly", a packet capture of this mechanism in action would certainly help. > + flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u", > + max_ack_seq, conn->seq_to_tap); > + tcp_send_flag(c, conn, ACK); > + conn->seq_to_tap = max_ack_seq; > + tcp_set_peek_offset(conn->sock, 0); > } > > if (!iov_i) > @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) > flow_dbg(conn, "activity timeout"); > tcp_rst(c, conn); > } > + /* No data exchanged recently? Keep connection alive. */ ...I just spotted this from v3: this is not the reason why we're sending a keep-alive. We're sending a keep-alive segment because the peer advertised its window as zero. I also realised that this is not scheduled additionally, so it will just trigger on an activity timeout, I suppose. We should reschedule this after ACK_TIMEOUT, instead (that was my earlier suggestion, I didn't check anymore) when the peer advertises a zero window. > + if (conn->seq_to_tap == conn->seq_ack_from_tap && ...this part will only work if we reset seq_to_tap to seq_ack_from_tap earlier, and we have no pending data to send, which is not necessarily the case if we want to send a zero-window probe. > + conn->seq_from_tap == conn->seq_ack_to_tap) > + tcp_send_flag(c, conn, ACK); I think the conditions should simply be: - the window currently advertised by the peer is zero - we don't have pending data to acknowledge (otherwise the peer can interpret our keep-alive as a duplicate ACK) > } > } > > diff --git a/tcp_conn.h b/tcp_conn.h > index d280b22..5cbad2a 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -30,6 +30,7 @@ > * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) > * @seq_to_tap: Next sequence for packets to tap > * @seq_ack_from_tap: Last ACK number received from tap > + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap > * @seq_from_tap: Next sequence for packets from tap (not actually sent) > * @seq_ack_to_tap: Last ACK number sent to tap > * @seq_init_from_tap: Initial sequence number from tap > @@ -101,6 +102,7 @@ struct tcp_tap_conn { > > uint32_t seq_to_tap; > uint32_t seq_ack_from_tap; > + uint32_t seq_wnd_edge_from_tap; > uint32_t seq_from_tap; > uint32_t seq_ack_to_tap; > uint32_t seq_init_from_tap; -- Stefano