From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 9C86A5A0272 for ; Wed, 24 Apr 2024 20:32:15 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1713983534; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8Az2ErNeO76Ew2qdOQTJ4uvxZEFe6owhMz4W0iCDSq4=; b=cs8mVWhlktquDpMCojoCmMY/JPMxnsctjCxgXMh7t4EFZENZR+h1nfzINi8RWP9Z4VW18j QgXqveXc1M/1qx5ZUSqGYvAwS9NmDNLCjrTEcIWm3hCq/nQRTCXh+5rbtFcK2FgbCbGclX +WG0+KNirP6qx46/EtxW7uoZes/5w9Y= Received: from mail-ed1-f69.google.com (mail-ed1-f69.google.com [209.85.208.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-582-tPLRB8rsOX-S_MdxTfJrXg-1; Wed, 24 Apr 2024 14:32:12 -0400 X-MC-Unique: tPLRB8rsOX-S_MdxTfJrXg-1 Received: by mail-ed1-f69.google.com with SMTP id 4fb4d7f45d1cf-56bf2d59fceso62315a12.3 for ; Wed, 24 Apr 2024 11:32:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713983531; x=1714588331; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=84J+B2t3HBV0J34nWBU27lmHQVsttH8OcnSR+U9Bx/8=; b=mEyzoFPMlJuH5IKpx50JUdBJOU9dCaOHGOWuoKpiqvazrsbqaOwIwihLlBM61elPqd PR7P4ozI2SokqhkJWXSO+qye04Kf/VybwF6U12oRTRf3ehxSXRz6FMTqbzCL/TzI1Rfi SHu+K7rhwpaVJMEZvFZXj8rG5X7oNOeB+bg8tjUVVOwgVYXlRnAmZeeeDZQPGdcKBsR3 A2kDmVax92d2uo4bJkaaooz2OaZcSCQPcOcwZy7PYmW9YbxmK9fEPXWWWeK7CCvpy+jS fyQ5YDEiLRbtOef5TWMkZexvujVltALnPhqeVpS5nheik46EoM39mq42dPHka+3zKz57 imQA== X-Forwarded-Encrypted: i=1; AJvYcCVcZpFH6NYgRflCw4359CbCq0eMU2GwO9nQfJczYrkA0sCKyc+/nB2xmnj6nf1Gt0AH3VGZBhfDOh/iOncwFaZtUVYf X-Gm-Message-State: AOJu0YxVvxCJxvct6147eAzBokJT4Ck8qah6dUn8jeBvtXZ/ES4RIgGv V9Od0E9e3ErSULprEaRPHLGuweQrCojPyALWeQB1fuBDjzrTGEq4qUJ06Iit0IduJKGLvmuoNho ICPOpgAxVjlExiPy7lrHBRibAikJ8p/bcZ9oCwjf0Ppk1EFU/BA== X-Received: by 2002:a50:a448:0:b0:56e:2452:f867 with SMTP id v8-20020a50a448000000b0056e2452f867mr2410801edb.37.1713983531027; Wed, 24 Apr 2024 11:32:11 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHGdASXUA5y+x+PJ0VtTQt48PtEazDWjJrh+mRc6rTK4xSx56byDE448xD9NyQ2HDC5OVZlOQ== X-Received: by 2002:a50:a448:0:b0:56e:2452:f867 with SMTP id v8-20020a50a448000000b0056e2452f867mr2410787edb.37.1713983530358; Wed, 24 Apr 2024 11:32:10 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id c4-20020a056402100400b005720caa01easm3737729edu.69.2024.04.24.11.32.09 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 Apr 2024 11:32:09 -0700 (PDT) Date: Wed, 24 Apr 2024 20:31:36 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH 2/2] tcp: allow retransmit when peer receive window is zero Message-ID: <20240424203136.3c142249@elisabeth> In-Reply-To: References: <20240420191920.104876-1-jmaloy@redhat.com> <20240420191920.104876-3-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: BQNPRDG62L523BY4DZBPEXY4XIG3UOAN X-Message-ID-Hash: BQNPRDG62L523BY4DZBPEXY4XIG3UOAN X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson , passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On top of David's comments: On Wed, 24 Apr 2024 11:04:51 +1000 David Gibson wrote: > On Sat, Apr 20, 2024 at 03:19:20PM -0400, Jon Maloy wrote: > > A bug in kernel TCP may lead to a deadlock where a zero window is sent > > from the peer, while buffer reads doesn't lead to it being updated. > > At the same time, the zero window stops this side from sending out > > more data to trigger new advertisements to be sent from the peer. =20 >=20 > This is a bit confusing to me. I guess the buffer reads must be > happening on the peer, not "this side", but that's not very clear. I > think "received from the peer" might be better than "sent from the > peer" to make it more obvious that the point of view is uniformly from > "this side". >=20 > > RFC 793 states that it always is permitted for a sender to send one byt= e > > of data even when the window is zero. As incredible as it might sound, RFC 793 is now obsolete :) -- you should refer to RFC 9293 instead. Further, RFC 9293 3.8.6.1 (just like RFC 793) says that we *must* (albeit not MUST) "regularly" send that octet -- as long as we have one available. Not that it's permitted. Hence, a general comment on this patch: my understanding is that we want to implement zero-window probing, instead of triggering a fast re-transmit whenever we get a keep-alive packet from the peer, because we don't know if we'll get one. RFC 9293 3.8.6.1 goes on saying: The transmitting host SHOULD send the first zero-window probe when a zero window has existed for the retransmission timeout period (SHLD-29) (Section 3.8.1), and SHOULD increase exponentially the interval between successive probes (SHLD-30). but we currently don't compute the retransmission timeout according to RFC 6298 (yeah, I know...). Given that it's a SHOULD, and complying with that is clearly beyond the scope of this change, we can just use one of the existing timeouts and timers (ACK_TIMEOUT would do, for the moment). > > This resolves the deadlock described > > above, so we choose to introduce it here as a last resort. We allow it = both > > during fast and as keep-alives when the timer sees no activity on > > the connection. =20 >=20 > Looks like there's a missing noun after "during fast". >=20 > > However, we notice that this solution doesn=C2=B4t work well. Traffic s= ometimes > > goes to zero, and onley recovers after the timer has resolved the situa= tion. =20 >=20 > s/onley/only/ >=20 > > Because of this, we chose to improve it slightly: The deadlock happens = when a =20 >=20 > Is it actually useful to describe the "bad" workaround if we're using > a different one? I don't see how the better workaround is an obvious > extension of the bad one. >=20 > > packet has been dropped at the peer end because of memory squeeze. We t= herefore > > consider it legitimate to retransmit that packet while considering the = window > > size that was valid at the moment it was first transmitted. This works > > much better. > >=20 > > It should be noted that although this solves the problem we have at han= d, > > it is not a genuine solution to the kernel bug. There may well be TCP s= tacks > > around in other OS-es which don't do this probing. > >=20 > > Signed-off-by: Jon Maloy > > --- > > tcp.c | 26 ++++++++++++++++---------- > > tcp_conn.h | 2 ++ > > 2 files changed, 18 insertions(+), 10 deletions(-) > >=20 > > diff --git a/tcp.c b/tcp.c > > index 95d400a..9dea151 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1774,6 +1774,7 @@ static void tcp_seq_init(const struct ctx *c, str= uct tcp_tap_conn *conn, > > =09ns =3D (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; > > =20 > > =09conn->seq_to_tap =3D ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns= ; > > +=09conn->max_seq_to_tap =3D conn->seq_to_tap; > > } > > =20 > > /** > > @@ -2123,9 +2124,8 @@ static void tcp_data_to_tap(const struct ctx *c, = struct tcp_tap_conn *conn, > > * > > * #syscalls recvmsg > > */ > > -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn= ) > > +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn= , uint32_t wnd_scaled) This exceeds 80 columns for no particularly good reason, and the function comment should be updated. > > { > > -=09uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > > =09int fill_bufs, send_bufs =3D 0, last_len, iov_rem =3D 0; > > =09int sendlen, len, plen, v4 =3D CONN_V4(conn); > > =09int s =3D conn->sock, i, ret =3D 0; > > @@ -2212,6 +2212,9 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > > =20 > > =09=09return 0; > > =09} > > +=09sendlen =3D len; > > +=09if (!peek_offset_cap) > > +=09=09sendlen -=3D already_sent; > > =20 > > =09sendlen =3D len; > > =09if (!peek_offset_cap) =20 >=20 > Looks like some duplicated lines from a bad rebase. >=20 > > @@ -2241,7 +2244,9 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > > =09=09tcp_data_to_tap(c, conn, plen, no_csum, seq); > > =09=09seq +=3D plen; > > =09} > > - > > +=09/* We need this to know this during retransmission: */ s/this to know this/to know this/ (I guess) > > +=09if (SEQ_GT(seq, conn->max_seq_to_tap)) > > +=09=09conn->max_seq_to_tap =3D seq; > > =09conn_flag(c, conn, ACK_FROM_TAP_DUE); > > =20 > > =09return 0; > > @@ -2317,8 +2322,7 @@ static int tcp_data_from_tap(struct ctx *c, struc= t tcp_tap_conn *conn, > > =09=09=09 SEQ_GE(ack_seq, max_ack_seq)) { > > =09=09=09=09/* Fast re-transmit */ > > =09=09=09=09retr =3D !len && !th->fin && > > -=09=09=09=09 ack_seq =3D=3D max_ack_seq && > > -=09=09=09=09 ntohs(th->window) =3D=3D max_ack_seq_wnd; > > +=09=09=09=09 ack_seq =3D=3D max_ack_seq; This matches any keep-alive (that is, a segment without data and without a window update) we get: I think you should replace the existing condition with a check that the window is zero, instead. > > =20 > > =09=09=09=09max_ack_seq_wnd =3D ntohs(th->window); > > =09=09=09=09max_ack_seq =3D ack_seq; > > @@ -2385,9 +2389,10 @@ static int tcp_data_from_tap(struct ctx *c, stru= ct tcp_tap_conn *conn, > > =09=09flow_trace(conn, > > =09=09=09 "fast re-transmit, ACK: %u, previous sequence: %u", > > =09=09=09 max_ack_seq, conn->seq_to_tap); > > + Spurious change. > > =09=09conn->seq_ack_from_tap =3D max_ack_seq; > > =09=09conn->seq_to_tap =3D max_ack_seq; > > -=09=09tcp_data_from_sock(c, conn); > > +=09=09tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->= seq_ack_from_tap)); =20 >=20 > Does the MAX do anything: if max_seq_to_tap equals seq_ack_from_tap > then, by definition, there is nothing to retransmit - everything has > been acked. >=20 > > =09} > > =20 > > =09if (!iov_i) > > @@ -2483,7 +2488,7 @@ static void tcp_conn_from_sock_finish(struct ctx = *c, struct tcp_tap_conn *conn, > > =09/* The client might have sent data already, which we didn't > > =09 * dequeue waiting for SYN,ACK from tap -- check now. > > =09 */ > > -=09tcp_data_from_sock(c, conn); > > +=09tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap= ); > > =09tcp_send_flag(c, conn, ACK); > > } > > =20 > > @@ -2575,7 +2580,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, i= nt af, > > =20 > > =09=09tcp_tap_window_update(conn, ntohs(th->window)); > > =20 > > -=09=09tcp_data_from_sock(c, conn); > > +=09=09tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_= tap); > > =20 > > =09=09if (p->count - idx =3D=3D 1) > > =09=09=09return 1; > > @@ -2788,7 +2793,7 @@ void tcp_timer_handler(struct ctx *c, union epoll= _ref ref) > > =09=09=09flow_dbg(conn, "ACK timeout, retry"); > > =09=09=09conn->retrans++; > > =09=09=09conn->seq_to_tap =3D conn->seq_ack_from_tap; > > -=09=09=09tcp_data_from_sock(c, conn); > > +=09=09=09tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - con= n->seq_ack_from_tap)); > > =09=09=09tcp_timer_ctl(c, conn); > > =09=09} > > =09} else { > > @@ -2807,6 +2812,7 @@ void tcp_timer_handler(struct ctx *c, union epoll= _ref ref) > > =09=09=09tcp_rst(c, conn); > > =09=09} > > =09} > > + =20 >=20 > Spurious change >=20 > > } > > =20 > > /** > > @@ -2843,7 +2849,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_= ref ref, uint32_t events) > > =09=09=09conn_event(c, conn, SOCK_FIN_RCVD); > > =20 > > =09=09if (events & EPOLLIN) > > -=09=09=09tcp_data_from_sock(c, conn); > > +=09=09=09tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_fr= om_tap); > > =20 > > =09=09if (events & EPOLLOUT) > > =09=09=09tcp_update_seqack_wnd(c, conn, 0, NULL); > > diff --git a/tcp_conn.h b/tcp_conn.h > > index a5f5cfe..afcdec9 100644 > > --- a/tcp_conn.h > > +++ b/tcp_conn.h > > @@ -29,6 +29,7 @@ > > * @wnd_from_tap:=09Last window size from tap, unscaled (as received) > > * @wnd_to_tap:=09=09Sending window advertised to tap, unscaled (as se= nt) > > * @seq_to_tap:=09=09Next sequence for packets to tap > > + * @max_seq_to_tap:=09Next seq after highest ever sent. Needeed during= retransmit Sequence numbers regularly wrap around (they are 32 bits), so you can't really define this. I'm not entirely sure how you use it, though -- I have the same question about the usage of MAX() above. > > * @seq_ack_from_tap:=09Last ACK number received from tap > > * @seq_from_tap:=09Next sequence for packets from tap (not actually s= ent) > > * @seq_ack_to_tap:=09Last ACK number sent to tap > > @@ -100,6 +101,7 @@ struct tcp_tap_conn { > > =09uint16_t=09wnd_to_tap; > > =20 > > =09uint32_t=09seq_to_tap; > > +=09uint32_t=09max_seq_to_tap; > > =09uint32_t=09seq_ack_from_tap; > > =09uint32_t=09seq_from_tap; > > =09uint32_t=09seq_ack_to_tap; =20 >=20 --=20 Stefano