From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 9DBB75A02E2 for ; Tue, 14 May 2024 23:10:23 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715721022; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9XtpJqSl9fpBu/xgOV9gP27uGIpeBQnzFWhP14R9RFo=; b=L+poiA4FnSkqdPRw8JQJTajXaEcW5S0WMQolgx/0JW7lWOp8GVSJJ+MkKXAkHT2hgEhJun vy/KpbPDCnKe6ZFdfa1l6crkhyhEmoGrqJOVPghLD5SwUtPYJrf8oEy62Eg1t70AZBLegX rfxy83Rs9FAl5XLsxrkX7KxRP3oCszo= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-547-Q_wqQt-qNbSwun5ae-A_mw-1; Tue, 14 May 2024 17:10:21 -0400 X-MC-Unique: Q_wqQt-qNbSwun5ae-A_mw-1 Received: by mail-ed1-f71.google.com with SMTP id 4fb4d7f45d1cf-572babec6c6so1436476a12.0 for ; Tue, 14 May 2024 14:10:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715721019; x=1716325819; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=gWeBx7rod11a7tk2+oPCADmhZI3QC+6LKmfXVwFzE9c=; b=I4GfwWCnqJgqRX4oKN8QIlR0fgxBvyzGP7NofFgi3GG9ykXy5QMnnraW3GfD9U17iE sA+e0skE/qCOWSAZV1Y2ZLxdDkWhSf2QRYdlAz45oTc7FtzV2qjWdwfXKcACRh4RZB8d i1WHQrU6zVgoVW4H3M9B8dW0kV2rGU1GZry1iYr3F6e6DuijGk3ckgQQB4/97Rq9alr3 z4rB1QKRUiQnp1wARZnjuY6b0f2o6Venic1UdvRpBJ2KX5tTktLx+v8NythHasZq7pSB EIHaW1BwsXNDez6HlQ3WOiZtS+zyFkBXJbpN0CEidCtiUSUI7wr5nC9KCG7UOpJP0X3O U6TQ== X-Gm-Message-State: AOJu0YzHpNVSuwgU6jfja1ql3TsBhsGPuCG03iZPDHlkTNv5Vk0sIwg+ rEC74OJGinHWgpWr7/uIkciY7Etyd9ifZZPd/D4byOLPKwXzWC9S4xijHhb5SOSs00UubkFiBMk /bWeNgw6V2MWkf304aybJhk/D+b0xoR9EURPJEE3Dmque+b3LS7sIUA8uULj14KfOU9Udf0y0Z9 sSAK7MXby5SAYhU2orXagx8WKVTEprbBm0TPU= X-Received: by 2002:a05:6402:34c6:b0:574:ea61:3b00 with SMTP id 4fb4d7f45d1cf-574ea613ba4mr2382973a12.8.1715721018782; Tue, 14 May 2024 14:10:18 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFPFvnUVbJ467JRJZHhZ16vb41DWLMOMLwQVHQ4paYXm2kAGfEPESHK3dVdTuinj7XVq7RA2g== X-Received: by 2002:a05:6402:34c6:b0:574:ea61:3b00 with SMTP id 4fb4d7f45d1cf-574ea613ba4mr2382932a12.8.1715721018055; Tue, 14 May 2024 14:10:18 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-574ec273185sm624937a12.42.2024.05.14.14.10.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 May 2024 14:10:17 -0700 (PDT) Date: Tue, 14 May 2024 23:09:43 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v3 3/3] tcp: allow retransmit when peer receive window is zero Message-ID: <20240514230943.3049d79a@elisabeth> In-Reply-To: References: <20240511152008.421750-1-jmaloy@redhat.com> <20240511152008.421750-4-jmaloy@redhat.com> <20240514194650.53b433f9@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: JPTFNW4VHZYP4ZR4VHTILK6GZIKCQL4Z X-Message-ID-Hash: JPTFNW4VHZYP4ZR4VHTILK6GZIKCQL4Z X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, 14 May 2024 16:19:16 -0400 Jon Maloy wrote: > On 2024-05-14 13:46, Stefano Brivio wrote: > > On Sat, 11 May 2024 11:20:08 -0400 > > Jon Maloy wrote: > > =20 > >> A bug in kernel TCP may lead to a deadlock where a zero window is sent > >> from the peer, while it is unable to send out window updates even afte= r > >> reads have freed up enough buffer space to permit a larger window. > >> In this situation, new window advertisemnts from the peer can only be > >> triggered by packets arriving from this side. > >> > >> However, such packets are never sent, because the zero-window conditio= n > >> currently prevents this side from sending out any packets whatsoever > >> to the peer. > >> > >> We notice that the above bug is triggered *only* after the peer has > >> dropped an arriving packet because of severe memory squeeze, and that = we > >> hence always enter a retransmission situation when this occurs. This > >> also means that it goes against the RFC 9293 recommendation that a > >> previously advertised window never should shrink. > >> > >> RFC 9293 gives the solution to this situation. In chapter 3.6.1 we fin= d > >> the following statement: > >> "A TCP receiver SHOULD NOT shrink the window, i.e., move the right > >> window edge to the left (SHLD-14). However, a sending TCP peer MUST > >> be robust against window shrinking, which may cause the > >> "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). > >> > >> If this happens, the sender SHOULD NOT send new data (SHLD-15), but > >> SHOULD retransmit normally the old unacknowledged data between SND.UNA > >> and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data > >> beyond SND.UNA+SND.WND (MAY-7)" > >> > >> We never see the window become negative, but we interpret this as a > >> recommendation to use the previously available window during > >> retransmission even when the currently advertised window is zero. > >> > >> In case of a zero-window non-retransmission situation where there > >> is no new data to be sent, we also add a simple zero-window probing > >> feature. By sending an empty packet at regular timeout events we > >> resolve the situation described above, since the peer receives the > >> necessary trigger to advertise its window once it becomes non-zero > >> again. > >> > >> It should be noted that although this solves the problem we have at > >> hand, it is not a genuine solution to the kernel bug. There may well > >> be TCP stacks around in other OS-es which don't do this, nor have > >> keep-alive probing as an alternatve way to solve the situation. > >> > >> Signed-off-by: Jon Maloy > >> > >> --- > >> v2: - Using previously advertised window during retransmission, instea= d > >> highest send sequencece number in the cycle. > >> v3: - Rebased to newest code > >> - Changes based on feedback from PASST team > >> - Sending out empty probe message at timer expiration when > >> we are not in retransmit situation. > >> --- > >> tcp.c | 30 +++++++++++++++++++++--------- > >> tcp_conn.h | 2 ++ > >> 2 files changed, 23 insertions(+), 9 deletions(-) > >> > >> diff --git a/tcp.c b/tcp.c > >> index 8297812..bd6bf35 100644 > >> --- a/tcp.c > >> +++ b/tcp.c > >> @@ -1774,9 +1774,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn = *conn, > >> */ > >> static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigne= d wnd) > >> { > >> +=09uint32_t wnd_upper; > >> + > >> =09wnd =3D MIN(MAX_WINDOW, wnd << conn->ws_from_tap); > >> =09conn->wnd_from_tap =3D MIN(wnd >> conn->ws_from_tap, USHRT_MAX); > >> =20 > >> +=09wnd_upper =3D conn->seq_ack_from_tap + wnd; > >> +=09if (wnd && SEQ_GT(wnd_upper, conn->seq_wup_from_tap)) > >> +=09=09conn->seq_wup_from_tap =3D wnd_upper; > >> + > >> =09/* FIXME: reflect the tap-side receiver's window back to the sock= -side > >> =09 * sender by adjusting SO_RCVBUF? */ > >> } > >> @@ -1809,6 +1815,7 @@ static void tcp_seq_init(const struct ctx *c, st= ruct tcp_tap_conn *conn, > >> =09ns =3D (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; > >> =20 > >> =09conn->seq_to_tap =3D ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + = ns; > >> +=09conn->seq_wup_from_tap =3D conn->seq_to_tap; > >> } > >> =20 > >> /** > >> @@ -2220,7 +2227,6 @@ static void tcp_data_to_tap(const struct ctx *c,= struct tcp_tap_conn *conn, > >> */ > >> static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *co= nn) > >> { > >> -=09uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > >> =09int fill_bufs, send_bufs =3D 0, last_len, iov_rem =3D 0; > >> =09int sendlen, len, dlen, v4 =3D CONN_V4(conn); > >> =09uint32_t max_send, seq, already_sent; > >> @@ -2241,10 +2247,11 @@ static int tcp_data_from_sock(struct ctx *c, s= truct tcp_tap_conn *conn) > >> =09} > >> =20 > >> =09/* How much are we still allowed to send within current window ? = */ > >> -=09max_send =3D conn->seq_ack_from_tap + wnd_scaled - conn->seq_to_ta= p; > >> +=09max_send =3D conn->seq_wup_from_tap - conn->seq_to_tap; > >> =09if (SEQ_LE(max_send, 0)) { > >> -=09=09flow_trace(conn, "Empty window: win: %u, sent: %u", > >> -=09=09=09 wnd_scaled, conn->seq_to_tap); > >> +=09=09flow_trace(conn, "Empty window: win_upper: %u, sent: %u", > >> +=09=09=09 conn->seq_wup_from_tap, conn->seq_to_tap); > >> +=09=09conn->seq_wup_from_tap =3D conn->seq_to_tap; > >> =09=09conn_flag(c, conn, STALLED); > >> =09=09conn_flag(c, conn, ACK_FROM_TAP_DUE); > >> =09=09return 0; > >> @@ -2380,7 +2387,7 @@ static int tcp_data_from_tap(struct ctx *c, stru= ct tcp_tap_conn *conn, > >> =09ASSERT(conn->events & ESTABLISHED); > >> =20 > >> =09for (i =3D idx, iov_i =3D 0; i < (int)p->count; i++) { > >> -=09=09uint32_t seq, seq_offset, ack_seq; > >> +=09=09uint32_t seq, seq_offset, ack_seq, wnd; > >> =09=09const struct tcphdr *th; > >> =09=09char *data; > >> =09=09size_t off; > >> @@ -2413,11 +2420,12 @@ static int tcp_data_from_tap(struct ctx *c, st= ruct tcp_tap_conn *conn, > >> =09=09=09if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) && > >> =09=09=09 SEQ_GE(ack_seq, max_ack_seq)) { > >> =09=09=09=09/* Fast re-transmit */ > >> +=09=09=09=09wnd =3D ntohs(th->window); > >> =09=09=09=09retr =3D !len && !th->fin && > >> =09=09=09=09 ack_seq =3D=3D max_ack_seq && > >> -=09=09=09=09 ntohs(th->window) =3D=3D max_ack_seq_wnd; > >> +=09=09=09=09 (wnd =3D=3D max_ack_seq_wnd || !wnd); =20 > > Just as a reminder, as I mentioned on Monday: this means we'll > > re-transmit whenever we get a pure window update (!len && !th->fin > > && ack_seq =3D=3D max_ack_seq) with a zero window. The receiver is tell= ing > > us it ran out of space, and wham, we flood them, as a punishment. > > > > I would let this check alone, and just add zero-window probing, plus > > whatever retransmission you mentioned from the RFC -- but not a fast > > re-transmit on a zero window. =20 > I think I have a good idea here. I'll use it in my next version. >=20 > > =20 > >> =20 > >> -=09=09=09=09max_ack_seq_wnd =3D ntohs(th->window); > >> +=09=09=09=09max_ack_seq_wnd =3D wnd; > >> =09=09=09=09max_ack_seq =3D ack_seq; > >> =09=09=09} > >> =09=09} > >> @@ -2480,8 +2488,9 @@ static int tcp_data_from_tap(struct ctx *c, stru= ct tcp_tap_conn *conn, > >> =20 > >> =09if (retr) { > >> =09=09flow_trace(conn, > >> -=09=09=09 "fast re-transmit, ACK: %u, previous sequence: %u", > >> -=09=09=09 max_ack_seq, conn->seq_to_tap); > >> +=09=09=09 "fast re-transmit, seqno %u -> %u, win_upper: %u", > >> +=09=09=09 conn->seq_to_tap, max_ack_seq, =20 > > I'm not sure if "->" really conveys the meaning of "we're sending this > > sequence *because* of that acknowledgement number". =20 > It really means "we are rewinding seq_to_tap from X to Y".=C2=A0 That it = is=20 > caused by a > duplicate ack is implicit. I wouldn't take that for granted, so much that with the current version of this patch, it's *not* necessarily caused by a duplicate acknowledgement. Anyway, it really doesn't look intuitive to me, and users have to figure out what's happening, too. > > I would rather keep > > the received acknowledged sequence before everything else, because > > that's the causal trigger for the retransmission. > > =20 > >> +=09=09=09 conn->seq_wup_from_tap); > >> =20 > >> =09=09conn->seq_to_tap =3D max_ack_seq; > >> =09=09tcp_set_peek_offset(conn->sock, 0); > >> @@ -2931,6 +2940,9 @@ void tcp_timer_handler(struct ctx *c, union epol= l_ref ref) > >> =09=09=09flow_dbg(conn, "activity timeout"); > >> =09=09=09tcp_rst(c, conn); > >> =09=09} > >> +=09=09/* No data sent recently? Keep connection alive. */ > >> +=09=09if (conn->seq_to_tap =3D=3D conn->seq_ack_from_tap) > >> +=09=09=09tcp_send_flag(c, conn, ACK_IF_NEEDED); =20 > > If the window is zero, this won't send anything, see the first > > condition in tcp_send_flag(). ACK_IF_NEEDED implies that that function > > should queue an ACK segment if we have data to acknowledge. =20 > Ok. I missed that. > > Here, the flag you want is simply 'ACK'. But we should make sure that > > this can't be taken as a duplicate ACK, that is, we should only send > > this if seq_ack_to_tap =3D=3D seq_from_tap. > > > > Otherwise, we shouldn't send anything, lest the peer retransmit > > anything that we didn't acknowledge yet. =20 > But then we have no probing... Wasn't that the whole pint of this? We generally do. We would have no probing only in case an ACK for data that the peer *sent* us is due by us (the other way around). There, probing would mean causing the peer to re-transmit, which we don't want to trigger here. > > =20 > >> =09} > >> } > >> =20 > >> diff --git a/tcp_conn.h b/tcp_conn.h > >> index d280b22..8ae20ef 100644 > >> --- a/tcp_conn.h > >> +++ b/tcp_conn.h > >> @@ -30,6 +30,7 @@ > >> * @wnd_to_tap:=09=09Sending window advertised to tap, unscaled (as = sent) > >> * @seq_to_tap:=09=09Next sequence for packets to tap > >> * @seq_ack_from_tap:=09Last ACK number received from tap > >> + * @seq_wup_from_tap:=09Right edge of last non-zero window from tap = =20 > > "Right edge" makes much more sense to me, and it also matches RFC > > language. Could we turn all the "wup" and "upper" references into > > something like "edge" or "right_edge"? =20 > I tried to come up with something short, because the field name becomes > impractically long. I am open to suggestions. @wnd_edge_from_tap? --=20 Stefano