From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 5DAE35A026E for ; Wed, 24 Apr 2024 03:04:59 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1713920695; bh=Zw4F64tJZU5VVe1yNilPo+Ikj4vzQ0oOJKUCcDoFy5k=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=NH+Q82RrW57SudqrHy7iAHALcngE+jcf4J+7uVoXamNhlemqihIAwnpQhHHsZcOiE 3Cd2YRYnbswDppB/UBVzRWdRXYS8XvhuyPra1g/6vm9GabnuyyNoD8KorhXtWqbL+e usPQCcaZqgvr5dYmb3950bHFLeq3E+uJSJeQfTvCe1OzF1B8s1kNcrCF33JPEV19Sl v4+jSedl2x/Xc/Cl5DH4LmeYGBkMAJI2SO8/udTSpiPXYIAkqf7JEjfJ8E5OgiqEMA I/WWDUwg5LA+Az6+uIgh1G24JJRPq1TBnZToZPkCv9HmhkHKJ8RuUqvVsqeYSWoQWQ E5zzZXcdUJtyg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4VPLPM3SfDz4wcd; Wed, 24 Apr 2024 11:04:55 +1000 (AEST) Date: Wed, 24 Apr 2024 11:04:51 +1000 From: David Gibson To: Jon Maloy Subject: Re: [PATCH 2/2] tcp: allow retransmit when peer receive window is zero Message-ID: References: <20240420191920.104876-1-jmaloy@redhat.com> <20240420191920.104876-3-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="pnR5HBUcNd22Vnb6" Content-Disposition: inline In-Reply-To: <20240420191920.104876-3-jmaloy@redhat.com> Message-ID-Hash: JTFSM2BIYBUSQ4PV4X3GBFO4LAGANWF2 X-Message-ID-Hash: JTFSM2BIYBUSQ4PV4X3GBFO4LAGANWF2 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --pnR5HBUcNd22Vnb6 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Apr 20, 2024 at 03:19:20PM -0400, Jon Maloy wrote: > A bug in kernel TCP may lead to a deadlock where a zero window is sent > from the peer, while buffer reads doesn't lead to it being updated. > At the same time, the zero window stops this side from sending out > more data to trigger new advertisements to be sent from the peer. This is a bit confusing to me. I guess the buffer reads must be happening on the peer, not "this side", but that's not very clear. I think "received from the peer" might be better than "sent from the peer" to make it more obvious that the point of view is uniformly from "this side". > RFC 793 states that it always is permitted for a sender to send one byte > of data even when the window is zero. This resolves the deadlock described > above, so we choose to introduce it here as a last resort. We allow it bo= th > during fast and as keep-alives when the timer sees no activity on > the connection. Looks like there's a missing noun after "during fast". > However, we notice that this solution doesn=B4t work well. Traffic someti= mes > goes to zero, and onley recovers after the timer has resolved the situati= on. s/onley/only/ > Because of this, we chose to improve it slightly: The deadlock happens wh= en a Is it actually useful to describe the "bad" workaround if we're using a different one? I don't see how the better workaround is an obvious extension of the bad one. > packet has been dropped at the peer end because of memory squeeze. We the= refore > consider it legitimate to retransmit that packet while considering the wi= ndow > size that was valid at the moment it was first transmitted. This works > much better. >=20 > It should be noted that although this solves the problem we have at hand, > it is not a genuine solution to the kernel bug. There may well be TCP sta= cks > around in other OS-es which don't do this probing. >=20 > Signed-off-by: Jon Maloy > --- > tcp.c | 26 ++++++++++++++++---------- > tcp_conn.h | 2 ++ > 2 files changed, 18 insertions(+), 10 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 95d400a..9dea151 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1774,6 +1774,7 @@ static void tcp_seq_init(const struct ctx *c, struc= t tcp_tap_conn *conn, > ns =3D (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; > =20 > conn->seq_to_tap =3D ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; > + conn->max_seq_to_tap =3D conn->seq_to_tap; > } > =20 > /** > @@ -2123,9 +2124,8 @@ static void tcp_data_to_tap(const struct ctx *c, st= ruct tcp_tap_conn *conn, > * > * #syscalls recvmsg > */ > -static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn, = uint32_t wnd_scaled) > { > - uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > int fill_bufs, send_bufs =3D 0, last_len, iov_rem =3D 0; > int sendlen, len, plen, v4 =3D CONN_V4(conn); > int s =3D conn->sock, i, ret =3D 0; > @@ -2212,6 +2212,9 @@ static int tcp_data_from_sock(struct ctx *c, struct= tcp_tap_conn *conn) > =20 > return 0; > } > + sendlen =3D len; > + if (!peek_offset_cap) > + sendlen -=3D already_sent; > =20 > sendlen =3D len; > if (!peek_offset_cap) Looks like some duplicated lines from a bad rebase. > @@ -2241,7 +2244,9 @@ static int tcp_data_from_sock(struct ctx *c, struct= tcp_tap_conn *conn) > tcp_data_to_tap(c, conn, plen, no_csum, seq); > seq +=3D plen; > } > - > + /* We need this to know this during retransmission: */ > + if (SEQ_GT(seq, conn->max_seq_to_tap)) > + conn->max_seq_to_tap =3D seq; > conn_flag(c, conn, ACK_FROM_TAP_DUE); > =20 > return 0; > @@ -2317,8 +2322,7 @@ static int tcp_data_from_tap(struct ctx *c, struct = tcp_tap_conn *conn, > SEQ_GE(ack_seq, max_ack_seq)) { > /* Fast re-transmit */ > retr =3D !len && !th->fin && > - ack_seq =3D=3D max_ack_seq && > - ntohs(th->window) =3D=3D max_ack_seq_wnd; > + ack_seq =3D=3D max_ack_seq; > =20 > max_ack_seq_wnd =3D ntohs(th->window); > max_ack_seq =3D ack_seq; > @@ -2385,9 +2389,10 @@ static int tcp_data_from_tap(struct ctx *c, struct= tcp_tap_conn *conn, > flow_trace(conn, > "fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); > + > conn->seq_ack_from_tap =3D max_ack_seq; > conn->seq_to_tap =3D max_ack_seq; > - tcp_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_ac= k_from_tap)); Does the MAX do anything: if max_seq_to_tap equals seq_ack_from_tap then, by definition, there is nothing to retransmit - everything has been acked. > } > =20 > if (!iov_i) > @@ -2483,7 +2488,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c= , struct tcp_tap_conn *conn, > /* The client might have sent data already, which we didn't > * dequeue waiting for SYN,ACK from tap -- check now. > */ > - tcp_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap); > tcp_send_flag(c, conn, ACK); > } > =20 > @@ -2575,7 +2580,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int= af, > =20 > tcp_tap_window_update(conn, ntohs(th->window)); > =20 > - tcp_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap); > =20 > if (p->count - idx =3D=3D 1) > return 1; > @@ -2788,7 +2793,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r= ef ref) > flow_dbg(conn, "ACK timeout, retry"); > conn->retrans++; > conn->seq_to_tap =3D conn->seq_ack_from_tap; > - tcp_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn, MAX(1, conn->max_seq_to_tap - conn->seq_a= ck_from_tap)); > tcp_timer_ctl(c, conn); > } > } else { > @@ -2807,6 +2812,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r= ef ref) > tcp_rst(c, conn); > } > } > + Spurious change > } > =20 > /** > @@ -2843,7 +2849,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_re= f ref, uint32_t events) > conn_event(c, conn, SOCK_FIN_RCVD); > =20 > if (events & EPOLLIN) > - tcp_data_from_sock(c, conn); > + tcp_data_from_sock(c, conn, conn->wnd_from_tap << conn->ws_from_tap); > =20 > if (events & EPOLLOUT) > tcp_update_seqack_wnd(c, conn, 0, NULL); > diff --git a/tcp_conn.h b/tcp_conn.h > index a5f5cfe..afcdec9 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -29,6 +29,7 @@ > * @wnd_from_tap: Last window size from tap, unscaled (as received) > * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) > * @seq_to_tap: Next sequence for packets to tap > + * @max_seq_to_tap: Next seq after highest ever sent. Needeed during ret= ransmit > * @seq_ack_from_tap: Last ACK number received from tap > * @seq_from_tap: Next sequence for packets from tap (not actually sent) > * @seq_ack_to_tap: Last ACK number sent to tap > @@ -100,6 +101,7 @@ struct tcp_tap_conn { > uint16_t wnd_to_tap; > =20 > uint32_t seq_to_tap; > + uint32_t max_seq_to_tap; > uint32_t seq_ack_from_tap; > uint32_t seq_from_tap; > uint32_t seq_ack_to_tap; --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --pnR5HBUcNd22Vnb6 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmYoWrIACgkQzQJF27ox 2Gd7bA/8CxQt3cjwi8jLEpLoHStEtVzl5DAsU6ixo7eHAlrCAX7ehvmOYBEzSYzb I2GvJCPzB1tUpB8O27ZkRZxU+UL7ZdTGqFqDQiwg9fAHXwxGzvxfBeOgzuEFNvdc gXseONp2IF5gk16NbQj7WiVbVzIyOXkACvqr6RnmyDvGerdbRhlwT8re2e6/WYPM vt35WtIv+V/j/naEqX0u6zQzfDHpfFdppG0dxzS91rVqYa0TJ0UJ9AuISRuaS+aV CsB1/YVBjPhmzBExWlTYdsIXG6sz2oXViFI/MS8B/rj3aLvTxGiQwtjCf3rii4fI AorLvQgoJtQ6wqzv/MLi22fa+d8NIL/b+Q6k0EU71SesQCzLVwC1opzoCKVTEk/8 GIBVjpKIMHtOwrK2lflMokk9GaB5/pEsJoSt0228rN1E337ZjMYYq03pj/N+S+Yq 8hZO5nFc+sAYhvMu1zgEx/0QpzH6yP144S/ZLBRPIsR0ZUUvOQM/+ceGPLWYin7c FXywGgF40uZ9qj01ERS5yqeUuAofnb5fbdNR7sDlJ2EuGCDIAppu0Z5MXRw1Da4j hEqeF8O8nHteGAthbBP4zGAjr7fxU5lxh98isNQFf/6nO0R7E47ny5VN+8V31yLx wY2HkVQe7aS2FPhmWZRrkzgKSD0MsZ91/+xRgbJ8r6BQDgajOfs= =Jdy1 -----END PGP SIGNATURE----- --pnR5HBUcNd22Vnb6--