From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 301D55A029A for ; Thu, 16 May 2024 13:23:23 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715858602; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EqXCiOp3BVPLfE+hnrTx34fO8UIgEfcZzbqYpWxJeCg=; b=O/13Kuxowu3uyfwrb8qUsEKfGb1vnapocx5eOWw8jVaTR8XzNNxOpwZ8RNPrPanwnNr3z2 QOWJvA85VlXtaX+PKGbDDjtsGdgTpJ1ZencMCTbst2hMOiVBsVja9dTQWh0xlNa101ECso kLYRtnWtJPjqCY7ZmjDnudzxAX7BJ0M= Received: from mail-lf1-f72.google.com (mail-lf1-f72.google.com [209.85.167.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-306-ub-pTMtEMqqDoL47SxfYwA-1; Thu, 16 May 2024 07:23:20 -0400 X-MC-Unique: ub-pTMtEMqqDoL47SxfYwA-1 Received: by mail-lf1-f72.google.com with SMTP id 2adb3069b0e04-52025a83abcso7369905e87.3 for ; Thu, 16 May 2024 04:23:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715858598; x=1716463398; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=2PpBd5cZkMfNtBsBhbW24yjnqTUHeQFsaX4UUfP/njE=; b=cvPOgz6pWTmzd7atgsLvJKztFXmsikply+2drJAIbz1rl5qBXXWo89lPUq+HHcE8yK 0BAtvualvhmCKFf76+mla4M9AvKwB/R5xk0o24pLsh86p/mnh5VHSgSUiPX5SQ0Ektco seNThkw3YL6dWnM3OcGZaHW9L3oKqmFMJ1HRfHLjGCVt2m6Yh+dtaCyNqoaT0PdALIOD LhjmVd0hE3s51ojZET3DB1S9NTM4vpnJqkJLQ9U3aM2k16+IGr2o57Z5nhHRF7zvVnXd TaXsOyfwN1U6tnugXeUg685hPgy1TCmE0EpLl5jEQXNQHvp64T7Vznr5MfrTY0UgfDSt yeNw== X-Gm-Message-State: AOJu0YzQujRpCL//PZMlAk/Bi4IzwAK4yFEChRMGYkD8CgW/l0KgCaeW cBG2ueIgCA+n6WGAvUUwRXal0XIwMPbkJvTfIGFKfJaZHSpVjo3KUojpH87kce9RMVgaZa+qSxu ElSgHV4awnm/wE2qzMqDfXdlOgAQxFBdbC7sSAc6d+AGlP0QO+ac/C+cEDhytXo0DsgVnjGscz4 +8tbpuVgzHjC31OcDeQSUNlM5i/KTKDKwwlPM= X-Received: by 2002:ac2:4d06:0:b0:51d:1239:21e8 with SMTP id 2adb3069b0e04-5221017e0a8mr12003134e87.37.1715858597973; Thu, 16 May 2024 04:23:17 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFRXJLMArr6yRHLM2OuvY9mJrmNOvU9cjG83vUHskxYNUpBjWT457iLQKS1t2boRtXS0crpww== X-Received: by 2002:ac2:4d06:0:b0:51d:1239:21e8 with SMTP id 2adb3069b0e04-5221017e0a8mr12003096e87.37.1715858596972; Thu, 16 May 2024 04:23:16 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a5a1781cd90sm963681666b.20.2024.05.16.04.23.15 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 16 May 2024 04:23:16 -0700 (PDT) Date: Thu, 16 May 2024 13:22:42 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero Message-ID: <20240516132242.20bc761b@elisabeth> In-Reply-To: References: <20240515153429.859185-1-jmaloy@redhat.com> <20240515153429.859185-4-jmaloy@redhat.com> <20240515222417.72ce256b@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: RXECOIUSHQVEHZFAH7Z3ZDTCUQISHIWV X-Message-ID-Hash: RXECOIUSHQVEHZFAH7Z3ZDTCUQISHIWV X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 15 May 2024 19:10:49 -0400 Jon Maloy wrote: > On 2024-05-15 16:24, Stefano Brivio wrote: > > On Wed, 15 May 2024 11:34:29 -0400 > > Jon Maloy wrote: > > =20 > >> A bug in kernel TCP may lead to a deadlock where a zero window is sent > >> from the peer, while it is unable to send out window updates even afte= r > >> reads have freed up enough buffer space to permit a larger window. > >> In this situation, new window advertisemnts from the peer can only be > >> triggered by packets arriving from this side. > >> > >> However, such packets are never sent, because the zero-window conditio= n > >> currently prevents this side from sending out any packets whatsoever > >> to the peer. > >> > >> We notice that the above bug is triggered *only* after the peer has > >> dropped an arriving packet because of severe memory squeeze, and that = we > >> hence always enter a retransmission situation when this occurs. This > >> also means that it goes against the RFC 9293 recommendation that a > >> previously advertised window never should shrink. > >> > >> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we fin= d > >> the following statement: > >> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right > >> window edge to the left (SHLD-14). However, a sending TCP peer MUST > >> be robust against window shrinking, which may cause the > >> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). > >> > >> If this happens, the sender SHOULD NOT send new data (SHLD-15), but > >> SHOULD retransmit normally the old unacknowledged data between SND.UNA > >> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data > >> beyond SND.UNA+SND.WND (MAY-7)" > >> > >> We never see the window become negative, but we interpret this as a > >> recommendation to use the previously available window during > >> retransmission even when the currently advertised window is zero. > >> > >> We use the above mechanism only at timer-induced retransmits. > >> In the case we receive duplicate ack and a zero window, but > >> still know we have outstanding data acks waiting, we send out an > >> empty "fast probe" instead of doing fast retransmit. This averts > >> the risk of overwhelming a memory squeezed peer with retransmits, > >> while still forcing it to send out a new window update when the > >> probe is received. This entails a theoretical risk of redundant > >> retransmits from the peer, but that is a risk worth taking. > >> > >> In case of a zero-window non-retransmission situation where there > >> is no new data to be sent, we also add a simple zero-window probing > >> feature. By sending an empty packet at regular timeout events we > >> resolve the situation described above, since the peer receives the > >> necessary trigger to advertise its window once it becomes non-zero > >> again. > >> > >> It should be noted that although this solves the problem we have at > >> hand, it is not a genuine solution to the kernel bug. There may well > >> be TCP stacks around in other OS-es which don't do this, nor have > >> keep-alive probing as an alternatve way to solve the situation. > >> > >> Signed-off-by: Jon Maloy > >> > >> --- > >> v2: - Using previously advertised window during retransmission, instea= d > >> highest send sequencece number in the cycle. > >> v3: - Rebased to newest code > >> - Changes based on feedback from PASST team > >> - Sending out empty probe message at timer expiration when > >> we are not in retransmit situation. > >> v4: - Some small changes based on feedback from PASST team. > >> - Replaced fast retransmit with a one-time 'fast probe' when > >> window is zero. > >> --- > >> tcp.c | 32 +++++++++++++++++++++++++++----- > >> tcp_conn.h | 2 ++ > >> 2 files changed, 29 insertions(+), 5 deletions(-) > >> > >> diff --git a/tcp.c b/tcp.c > >> index 4163bf9..a33f494 100644 > >> --- a/tcp.c > >> +++ b/tcp.c > >> @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn = *conn, > >> */ > >> static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigne= d wnd) > >> { > >> +=09uint32_t wnd_edge; > >> + > >> =09wnd =3D MIN(MAX_WINDOW, wnd << conn->ws_from_tap); > >> =09conn->wnd_from_tap =3D MIN(wnd >> conn->ws_from_tap, USHRT_MAX); > >> =20 > >> +=09wnd_edge =3D conn->seq_ack_from_tap + wnd; > >> +=09if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) =20 > > Here, cppcheck ('make cppcheck') says: > > > > tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrue= False] > > if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) > > ^ > > tcp.c:1766:8: note: Assignment 'wnd=3D((1<<(16+8))<(wnd<ws_from_= tap))?(1<<(16+8)):(wnd<ws_from_tap)', assigned value is less than 1 > > wnd =3D MIN(MAX_WINDOW, wnd << conn->ws_from_tap); > > ^ > > tcp.c:1770:6: note: Condition 'wnd' is always true > > if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) > > ^ > > > > See the comment in tcp_update_seqack_wnd() and related suppression. > > > > It's clearly a false positive (if you omit the MIN() macro, it goes > > away), so we need that same suppression here. =20 > Ok. I'll change it. Still a little annoying when our tools are causing=20 > us extra job because they aren't up to the task. Just like in David's experience, cppcheck probably saved me a substantial number of embarrassing situations. I would say it's very much up to the task to be honest, a few false positives are something we can definitely expect. > > =20 > >> +=09=09conn->seq_wnd_edge_from_tap =3D wnd_edge; > >> + > >> =09/* FIXME: reflect the tap-side receiver's window back to the sock= -side > >> =09 * sender by adjusting SO_RCVBUF? */ > >> } > >> @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, st= ruct tcp_tap_conn *conn, > >> =09ns =3D (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; > >> =20 > >> =09conn->seq_to_tap =3D ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + = ns; > >> +=09conn->seq_wnd_edge_from_tap =3D conn->seq_to_tap; > >> } > >> =20 > >> /** > >> @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *= c, struct tcp_tap_conn *conn, > >> */ > >> static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *co= nn) > >> { > >> -=09uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > >> =09int fill_bufs, send_bufs =3D 0, last_len, iov_rem =3D 0; > >> =09int sendlen, len, dlen, v4 =3D CONN_V4(conn); > >> +=09uint32_t already_sent, max_send, seq; > >> =09int s =3D conn->sock, i, ret =3D 0; > >> =09struct msghdr mh_sock =3D { 0 }; > >> =09uint16_t mss =3D MSS_GET(conn); > >> -=09uint32_t already_sent, seq; > >> =09struct iovec *iov; > >> =20 > >> =09/* How much have we read/sent since last received ack ? */ > >> @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, s= truct tcp_tap_conn *conn) > >> =09=09tcp_set_peek_offset(s, 0); > >> =09} > >> =20 > >> -=09if (!wnd_scaled || already_sent >=3D wnd_scaled) { > >> +=09/* How much are we still allowed to send within current window ? *= / > >> +=09max_send =3D conn->seq_wnd_edge_from_tap - conn->seq_to_tap; > >> +=09if (SEQ_LE(max_send, 0)) { > >> +=09=09flow_trace(conn, "Empty window: win_upper: %u, sent: %u", =20 > > This is not win_upper anymore, and the window is actually full rather > > than empty (it's... empty of space). Maybe: > > > > =09=09flow_trace(conn, "Window full: right edge: %u, sent: %u" =20 > yes. > > =20 > >> +=09=09=09 conn->seq_wnd_edge_from_tap, conn->seq_to_tap); > >> +=09=09conn->seq_wnd_edge_from_tap =3D conn->seq_to_tap; > >> =09=09conn_flag(c, conn, STALLED); > >> =09=09conn_flag(c, conn, ACK_FROM_TAP_DUE); > >> =09=09return 0; > >> =09} > >> =20 > >> =09/* Set up buffer descriptors we'll fill completely and partially.= */ > >> -=09fill_bufs =3D DIV_ROUND_UP(wnd_scaled - already_sent, mss); > >> +=09fill_bufs =3D DIV_ROUND_UP(max_send, mss); > >> =09if (fill_bufs > TCP_FRAMES) { > >> =09=09fill_bufs =3D TCP_FRAMES; > >> =09=09iov_rem =3D 0; > >> =09} else { > >> -=09=09iov_rem =3D (wnd_scaled - already_sent) % mss; > >> +=09=09iov_rem =3D max_send % mss; > >> =09} > >> =20 > >> =09/* Prepare iov according to kernel capability */ > >> @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, str= uct tcp_tap_conn *conn, > >> =09=09conn->seq_to_tap =3D max_ack_seq; > >> =09=09tcp_set_peek_offset(conn->sock, 0); > >> =09=09tcp_data_from_sock(c, conn); > >> +=09} else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_se= q)) { > >> +=09=09/* Force peer to send new advertisement now, but only once */ = =20 > > Two questions: > > > > - which advertisement? We're sending a zero-window probe, not forcing > > the peer to do much really. I would rather just state that we're > > sending a probe =20 > Actually, it is only clear if you know the code of the (linux) peer. > I realized this was maybe a too strong statement, but this is really > what happens. Our peer is not necessarily the Linux kernel, though, and, especially, not a specific (broken) version of it. > > - what guarantees it only happens once? If we get more data from the > > socket, we'll get again SEQ_GT(conn->seq_to_tap, max_ack_seq) in a > > bit, and send another ACK (duplicate) to the peer, without the peer > > necessarily ever advertising a non-zero window meanwhile =20 > Right. I need to give this=C2=A0 a new spin, but I think I am on the righ= t=20 > track. > I *really* want this to be solved here, and not need to fall back to > =C2=A0the timeout except under exceptional conditions. > This happens really often in my tests, and with this fix it actually > works like a charm. >=20 > > > > I'm struggling a bit to understand how this can work "cleanly", a > > packet capture of this mechanism in action would certainly help. =20 > Ok. In v5. > > =20 > >> +=09=09flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u", > >> +=09=09=09 max_ack_seq, conn->seq_to_tap); > >> +=09=09tcp_send_flag(c, conn, ACK); > >> +=09=09conn->seq_to_tap =3D max_ack_seq; > >> +=09=09tcp_set_peek_offset(conn->sock, 0); > >> =09} > >> =20 > >> =09if (!iov_i) > >> @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epo= ll_ref ref) > >> =09=09=09flow_dbg(conn, "activity timeout"); > >> =09=09=09tcp_rst(c, conn); > >> =09=09} > >> +=09=09/* No data exchanged recently? Keep connection alive. */ =20 > > ...I just spotted this from v3: this is not the reason why we're > > sending a keep-alive. We're sending a keep-alive segment because the > > peer advertised its window as zero. > > > > I also realised that this is not scheduled additionally, so it will > > just trigger on an activity timeout, I suppose. We should reschedule > > this after ACK_TIMEOUT, instead (that was my earlier suggestion, I > > didn't check anymore) when the peer advertises a zero window. > > =20 > >> +=09=09if (conn->seq_to_tap =3D=3D conn->seq_ack_from_tap && =20 > > ...this part will only work if we reset seq_to_tap to seq_ack_from_tap > > earlier, and we have no pending data to send, which is not necessarily > > the case if we want to send a zero-window probe. > > =20 > >> +=09=09 conn->seq_from_tap =3D=3D conn->seq_ack_to_tap) > >> +=09=09=09tcp_send_flag(c, conn, ACK); =20 > > I think the conditions should simply be: > > > > - the window currently advertised by the peer is zero > > > > - we don't have pending data to acknowledge (otherwise the peer can > > interpret our keep-alive as a duplicate ACK) =20 > That is ok. > The deadlock situation we have ended up in will anyway be resolved by the > real retransmit that will happen before we get here. > I am now wondering if this probe it serves any purpose at all? Well, RFC 9293 says we must implement it, so it's a good idea to do that, I guess. Let's say your first probe is lost for whatever reason (if the system is low on memory, that is actually likely to happen with tap devices). > > =20 > >> =09} > >> } > >> =20 > >> diff --git a/tcp_conn.h b/tcp_conn.h > >> index d280b22..5cbad2a 100644 > >> --- a/tcp_conn.h > >> +++ b/tcp_conn.h > >> @@ -30,6 +30,7 @@ > >> * @wnd_to_tap:=09=09Sending window advertised to tap, unscaled (as = sent) > >> * @seq_to_tap:=09=09Next sequence for packets to tap > >> * @seq_ack_from_tap:=09Last ACK number received from tap > >> + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from ta= p > >> * @seq_from_tap:=09Next sequence for packets from tap (not actually= sent) > >> * @seq_ack_to_tap:=09Last ACK number sent to tap > >> * @seq_init_from_tap:=09Initial sequence number from tap > >> @@ -101,6 +102,7 @@ struct tcp_tap_conn { > >> =20 > >> =09uint32_t=09seq_to_tap; > >> =09uint32_t=09seq_ack_from_tap; > >> +=09uint32_t=09seq_wnd_edge_from_tap; > >> =09uint32_t=09seq_from_tap; > >> =09uint32_t=09seq_ack_to_tap; > >> =09uint32_t=09seq_init_from_tap; =20 >=20 > I'll re-spin the two first ones asap, so you can apply them, and then I= =20 > will try to improve > this one further. Just mind that if 2/3 makes the issue you're working around here more likely to happen (me, I've never seen it), we shouldn't go ahead with 2/3 without this, I guess. --=20 Stefano