From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from imap.gmail.com [173.194.76.109] by localhost with POP3 (fetchmail-6.3.26) for (single-drop); Tue, 21 May 2024 07:52:48 +0200 (CEST) Received: by 2002:a05:6a10:9148:b0:55f:c3c0:ed08 with SMTP id n8csp850476pxb; Mon, 20 May 2024 22:52:23 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCX/pbL532MdyQD0rZm4vrO6fwZ11rnDY/EFYdIeploEwFPeR+01q57YvkRHaVwbqURi2YXDUntngHW4mwZ0IGjcALlSxEBsUj0= X-Google-Smtp-Source: AGHT+IE6DcXcKy2ADIk59o+ZRRpLG6OSLQpHFLOU5IBmhHXAbhe5a1MZWZGFF6DSqy8h3ZkwXkd4 X-Received: by 2002:ac8:7e83:0:b0:43a:e5c6:b70b with SMTP id d75a77b69052e-43dfdadb44cmr356848801cf.31.1716270743292; Mon, 20 May 2024 22:52:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1716270743; cv=none; d=google.com; s=arc-20160816; b=R9Klq6hOWX26B3/zndNm0Sd0dDqzsjw9Bc0MZpvyq8syemy7AiLfbLKLuPzMxAOlM4 x06nMn84Vy/IV09O4n9GarsTcZnZnYFP6XOEktcFaA+IlDW3f2gfEYoNA3i2GBC34LhG +R/0cjmK5RGjxzDe3bHxItppdhNHbDPkCMd+3053ncTGPt4oc16lFffnilgnjFTJuKhD xppDq49CUGFA3rrwho8buyFfyRC/pj4klR1JHEAc5tYIjF6MI6Co5+/m3syLNkh1f/zi Yhrw+/dmexw3xWdqSIbGlo/MJRqrfHvU+VM16sqwVFGwrLXcceQrGiCcs6AGLJfA+K8E MxsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:dkim-signature:delivered-to; bh=TfLkdZu6sEVDoqUwE+q1HvfECSxMS9Ps/5DeFGqzxAM=; fh=MBMSY9n9QGUzmoRBE6HD7OZVa/6vAxSpDuj2NqustW0=; b=Q9S5iBWxmcxZ19KmV51KaC8eA4HnulOnUpE8dgCsUad+CpgL+RYipdAdU2T+VsN/YB JNr8ek6e0hczy5WuVxSIHbHUr76XBNog7F8D80LTft9T3b+CmFbyRO6ppT8QFPYWTRcg NrmReNJU/CJ+weOnF5M4InyvygL8kqI2oCZc5S/kq1c8c6Wa2MgGdOneIBz8r8CranVW FYaJFicFrk/tQIEX4QC+X78JX6Bybrmil97nz6KZZ1gRbb0FujMFceOmcny0cKG/EOnI y2adcfRzybdaUGaeAkCkQ+DEGqHGfr2DNhPmfNjpaT5x9p88i5qJMP3YlRnV3k66x0ZF 0xuA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=temperror (no key for signature) header.i=@gibson.dropbear.id.au header.s=202312 header.b=IdBGATXz; spf=pass (google.com: domain of dgibson@gandalf.ozlabs.org designates 150.107.74.76 as permitted sender) smtp.mailfrom=dgibson@gandalf.ozlabs.org Return-Path: Received: from us-smtp-inbound-delivery-1.mimecast.com (us-smtp-delivery-1.mimecast.com. [170.10.128.131]) by mx.google.com with ESMTPS id d75a77b69052e-43e1142184dsi207159691cf.278.2024.05.20.22.52.22 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 May 2024 22:52:22 -0700 (PDT) Received-SPF: pass (google.com: domain of dgibson@gandalf.ozlabs.org designates 150.107.74.76 as permitted sender) client-ip=150.107.74.76; Authentication-Results: mx.google.com; dkim=temperror (no key for signature) header.i=@gibson.dropbear.id.au header.s=202312 header.b=IdBGATXz; spf=pass (google.com: domain of dgibson@gandalf.ozlabs.org designates 150.107.74.76 as permitted sender) smtp.mailfrom=dgibson@gandalf.ozlabs.org Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-515-gfVm6ZzMN7GjDfxW_0YyVA-1; Tue, 21 May 2024 01:52:20 -0400 X-MC-Unique: gfVm6ZzMN7GjDfxW_0YyVA-1 Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9C5C51956089 for ; Tue, 21 May 2024 05:52:19 +0000 (UTC) Received: by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) id 8BCB21955D7C; Tue, 21 May 2024 05:52:19 +0000 (UTC) Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.33]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 8903C1955D7D for ; Tue, 21 May 2024 05:52:19 +0000 (UTC) Received: from us-smtp-inbound-delivery-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 424C0193618A for ; Tue, 21 May 2024 05:52:19 +0000 (UTC) Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-244-ccSAneJWM5akRJxNs56Qxw-1; Tue, 21 May 2024 01:52:14 -0400 X-MC-Unique: ccSAneJWM5akRJxNs56Qxw-1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1716270727; bh=TfLkdZu6sEVDoqUwE+q1HvfECSxMS9Ps/5DeFGqzxAM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=IdBGATXz/aA4t071UVcS6l62WHPObV31Ia+EF6oabzwBtHDHpv8AKnB8ylwcxsYO7 abCCeXQAfxEWOYZMmtenLGYYTW1D3GMo5XYuNYANz1iN0orUDx2GBx9UOEQSGZT3uH QDXtRqL3htbXIulayAuaCZtQZFlkUtjdfmnrKNl+UJCl47II/jtIl9X2+XuHdznMoE fQZyNsCPCqeszlkObohW3Srtoo7bRswuEHZy7EHy2yTUZh5ZibR8sfWmkH3PsdgL2U hiuKm64vz49VNnW96qO9C3xL1/LxsLZGC3FtTwSKw+cDH9CQEpBEXvdp07icOjb9cr 2vMOrJgGSoaQQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Vk3VH6D2Cz4x11; Tue, 21 May 2024 15:52:07 +1000 (AEST) Date: Tue, 21 May 2024 15:51:58 +1000 From: David Gibson To: Jon Maloy Cc: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com Subject: Re: [PATCH v6 3/3] tcp: allow retransmit when peer receive window is zero Message-ID: References: <20240517152414.1188282-1-jmaloy@redhat.com> <20240517152414.1188282-4-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="swKsTYNpfw4Q5RzL" Content-Disposition: inline In-Reply-To: <20240517152414.1188282-4-jmaloy@redhat.com> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 List-Id: --swKsTYNpfw4Q5RzL Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, May 17, 2024 at 11:24:14AM -0400, Jon Maloy wrote: > A bug in kernel TCP may lead to a deadlock where a zero window is sent > from the peer, while it is unable to send out window updates even after > reads have freed up enough buffer space to permit a larger window. > In this situation, new window advertisemnts from the peer can only be > triggered by packets arriving from this side. >=20 > However, such packets are never sent, because the zero-window condition > currently prevents this side from sending out any packets whatsoever > to the peer. >=20 > We notice that the above bug is triggered *only* after the peer has > dropped an arriving packet because of severe memory squeeze, and that we > hence always enter a retransmission situation when this occurs. This > also means that it goes against the RFC 9293 recommendation that a > previously advertised window never should shrink. >=20 > RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find > the following statement: > "A TCP receiver SHOULD NOT shrink the window, i.e., move the right > window edge to the left (SHLD-14). However, a sending TCP peer MUST > be robust against window shrinking, which may cause the > "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). >=20 > If this happens, the sender SHOULD NOT send new data (SHLD-15), but > SHOULD retransmit normally the old unacknowledged data between SND.UNA > and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data > beyond SND.UNA+SND.WND (MAY-7)" So... I'm beginning to think this section of the rfc isn't really helpful or useful here. For starters, it doesn't seem to cover all of what we're trying to do here - particularly the fact that we try to send keepalive probes when in this situation... > We never see the window become negative, but we interpret this as a > recommendation to use the previously available window during > retransmission even when the currently advertised window is zero. =2E.. but also, looking at the RFC, I'm really not convinced of this interpretation. SND.WND generally refers to the last window we've seen advertised by the guest, and I don't see any indication that in this specific case we should instead consider the previous version it had. Indeed the "usable window" value it's discussing is elsewhere described in terms of SND.WND, and if we used the previous SND.WND value it would *not* become negative. I believe that last MAY-7 bit means we're not violating the RFC by using the previous window edge, but I don't think there's anything there to suggest we must or should be doing so. [In fact, I wonder if the reason behind MAY-7 is that it allows an implementation to satisfy this by just ignoring ignore window updates which would move the right edge backwards] So.. moving on from the RFC to what we actually need to do to workaround this bug. Do we actually need anything more than continuing to send keep-alive probes even when the window is zero? > We use the above mechanism only for timer-induced retransmits, while > the fast-retransmit mechanism won't trigger on this condition. >=20 > It should be noted that although this solves the problem we have at > hand, it is not a genuine solution to the kernel bug. There may well > be TCP stacks around in other OS-es which don't do this, nor have > keep-alive probing as an alternatve way to solve the situation. >=20 > Signed-off-by: Jon Maloy >=20 > --- > v2: - Using previously advertised window during retransmission, instead > highest send sequencece number in the cycle. > v3: - Rebased to newest code > - Changes based on feedback from PASST team > - Sending out empty probe message at timer expiration when > we are not in retransmit situation. > v4: - Some small changes based on feedback from PASST team. > - Replaced fast retransmit with a one-time 'fast probe' when > window is zero. > v5: - Gave up on 'fast probing' for now. When I got the sequence > numbers right in the flag message (after having emptied the tap > queue), it turns out an empty message does *not* force a new peer > window update as was my previous understanding of the code. > - Added cppcheck suppression line (which I was unable to verify) > as suggested by S. Brivio. > - Removed sending of empty probe when window from tap side is zero. > It looks pointless at the moment, at least for solving the above > described situation. > v6: - Ensure that arrival of new data doesn=B4t cause us to ignore a > zero-window situation. > - Removed the pointless probing referred to in v5 comment. > --- > tcp.c | 26 ++++++++++++++++++++------ > tcp_conn.h | 2 ++ > 2 files changed, 22 insertions(+), 6 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index fa13292..38c3480 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1764,9 +1764,17 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *co= nn, > */ > static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wn= d) > { > + uint32_t wnd_edge; > + > wnd =3D MIN(MAX_WINDOW, wnd << conn->ws_from_tap); > + /* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */ If I recall from earlier, we thought this suppression was needed because of the cppcheck bug referenced in tcp_update_seqack_wnd(). If that's the case we need something like that comment here as well: knownConditionTrueFalse is not a check we should be suppressing lightly. But also... is it actually that bug? In that case the check tripped when we did an if based on the result of the MIN - it thought it was always zero. But here the suppression is on the MIN itself, which suggests something different. Is it instead that cppcheck is managing to deduce that wnd >> conn->ws_from_tap cannot be greater than USHRT_MAX. Which should indeed be the case, although I can't quickly see how you'd statically deduce it. I'm also not sure why this is showing up now, because these lines aren't changed. > + I also don't think inserting a blank line between the suppression and the line where the error is occuring is a good idea. > conn->wnd_from_tap =3D MIN(wnd >> conn->ws_from_tap, USHRT_MAX); > =20 > + wnd_edge =3D conn->seq_ack_from_tap + wnd; > + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) > + conn->seq_wnd_edge_from_tap =3D wnd_edge; > + > /* FIXME: reflect the tap-side receiver's window back to the sock-side > * sender by adjusting SO_RCVBUF? */ > } > @@ -1799,6 +1807,7 @@ static void tcp_seq_init(const struct ctx *c, struc= t tcp_tap_conn *conn, > ns =3D (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; > =20 > conn->seq_to_tap =3D ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; > + conn->seq_wnd_edge_from_tap =3D conn->seq_to_tap; > } > =20 > /** > @@ -2208,13 +2217,12 @@ static void tcp_data_to_tap(const struct ctx *c, = struct tcp_tap_conn *conn, > */ > static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > { > - uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > int fill_bufs, send_bufs =3D 0, last_len, iov_rem =3D 0; > int sendlen, len, dlen, v4 =3D CONN_V4(conn); > + uint32_t already_sent, max_send, seq; > int s =3D conn->sock, i, ret =3D 0; > struct msghdr mh_sock =3D { 0 }; > uint16_t mss =3D MSS_GET(conn); > - uint32_t already_sent, seq; > struct iovec *iov; > =20 > /* How much have we read/sent since last received ack ? */ > @@ -2228,19 +2236,24 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > tcp_set_peek_offset(s, 0); > } > =20 > - if (!wnd_scaled || already_sent >=3D wnd_scaled) { > + /* How much are we still allowed to send within current window ? */ > + max_send =3D conn->seq_wnd_edge_from_tap - conn->seq_to_tap; > + if (SEQ_LE(max_send, 0)) { Although the maths probably works out correctly, I dislike using SEQ_LE on sequence differences here, rather that using SEQ_LE directly on seq_wnd_edge_from_tap and seq_to_tap. > + flow_trace(conn, "Window full: right edge: %u, sent: %u", > + conn->seq_wnd_edge_from_tap, conn->seq_to_tap); > + conn->seq_wnd_edge_from_tap =3D conn->seq_to_tap; So, here we pull seq_wnd_edge_from_tap back in line with seq_to_tap. Which might be before even the "current" window of seq_ack_to_tap + wnd_scaled. Which means there's a pretty brief window in which seq_wnd_edge_from_tap will actually be beyond the latest window. It's not clear to me why that brief window is important - or why getting more data from the socket side would be relevant to finishing that window. > conn_flag(c, conn, STALLED); > conn_flag(c, conn, ACK_FROM_TAP_DUE); > return 0; > } > =20 > /* Set up buffer descriptors we'll fill completely and partially. */ > - fill_bufs =3D DIV_ROUND_UP(wnd_scaled - already_sent, mss); > + fill_bufs =3D DIV_ROUND_UP(max_send, mss); > if (fill_bufs > TCP_FRAMES) { > fill_bufs =3D TCP_FRAMES; > iov_rem =3D 0; > } else { > - iov_rem =3D (wnd_scaled - already_sent) % mss; > + iov_rem =3D max_send % mss; > } > =20 > /* Prepare iov according to kernel capability */ > @@ -2347,6 +2360,7 @@ err: > * > * Return: count of consumed packets > */ > + Spurious whitespace change. > static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, > const struct pool *p, int idx) > { > @@ -2950,7 +2964,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_re= f ref, uint32_t events) > if (events & (EPOLLRDHUP | EPOLLHUP)) > conn_event(c, conn, SOCK_FIN_RCVD); > =20 > - if (events & EPOLLIN) > + if (events & EPOLLIN && conn->wnd_from_tap) Hrm. If we don't even enter tcp_data_from_sock() when there's no window, doesn't that mean we won't hit the handling for the max_send < 0 case, we won't set STALLED, won't switch the epoll flags for the socket to edge triggered mode and will therefore just busy loop on EPOLLIN socket events until the window re-opens. > tcp_data_from_sock(c, conn); > =20 > if (events & EPOLLOUT) > diff --git a/tcp_conn.h b/tcp_conn.h > index d280b22..5cbad2a 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -30,6 +30,7 @@ > * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) > * @seq_to_tap: Next sequence for packets to tap > * @seq_ack_from_tap: Last ACK number received from tap > + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap > * @seq_from_tap: Next sequence for packets from tap (not actually sent) > * @seq_ack_to_tap: Last ACK number sent to tap > * @seq_init_from_tap: Initial sequence number from tap > @@ -101,6 +102,7 @@ struct tcp_tap_conn { > =20 > uint32_t seq_to_tap; > uint32_t seq_ack_from_tap; > + uint32_t seq_wnd_edge_from_tap; > uint32_t seq_from_tap; > uint32_t seq_ack_to_tap; > uint32_t seq_init_from_tap; --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --swKsTYNpfw4Q5RzL Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmZMNnkACgkQzQJF27ox 2Gc8Bw//ew2N43M1iiLX4DczuYcAD0dTCdochU6kBu8cRsNKLbZ1SvFgcLsfwrVb X5GpOaCKlpO7Mcn8lMQR2zJ4dhYkox+bTE5ljM+1So5CiPcCI429pscYKVyokS4D xWTzd1Ie0qxwxgcJHri3Zba8pjuq8RKyY/SZi6XFKuJMqTz70ndC7X4vkmPeCnYA 0eR59DNJTsHKh9G2fIeoYooTUqFsfUdX0IzI845MAnNoVNZW8mdN63sHGLox4QYr CWj5rCM0BuyYReXbTmLM+eY+3AAdPL3ounz1lK4nA7p56yytZNY4TAHSMn5jYSKI npDnl6VB9QOIA6DtXZ7o9beZCXWoJF7yIWyKYDS38TIEUO7ldZDzYT1ioLquDRJX lKsA2bs+asRZJNfNKdsLEe9mwzJoQD/mWlm09GlVLB2xGaGJjK4Zf5A4yu+4XKjp FJ5htCLOoStM9bGnqk7YAFPIOexeBS0FZC4gE9z+gq8OsrQ0mrj2K5K8C+dt0CSC cXgP8/7hTPJOotxgam1i0vY3BzspiDfDc+UGIDksdd/D6fRdcxk3KX+VWXSAc4O5 0T5dKukTlId8YZx17nG3zDoC0hR+9etO8exGR052ej8EudhzhsMIbWusp7OyjQH1 fOj9zmAXuDEtsH7B9EbimQI/XHLo+J+x02NC6/qRtDk5xohJlQQ= =aODd -----END PGP SIGNATURE----- --swKsTYNpfw4Q5RzL--