From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 24F715A026D for ; Fri, 10 Nov 2023 06:20:26 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699593624; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dt3j0ctoAcNWLIErOySYt3WpqUF65THxE4mqvc2H1tE=; b=OqJXdys0oYuJS0HOEsixSk8FMmJjG7egq519vGuocRLT2VQUxf4FcU/ZIZWoQdelfEICF4 yv3oLBOEVsrJ7x3N7mAsxBG5BPizyKR5+TV3bTt32uqxmtgz/7HD9SVbEywjAEqqkDwHoz yt4fIH6CPjqXbj0gFjD+Y0jLdVBqdRk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-211-jqBeySgNPeOwfaqAWlF5YA-1; Fri, 10 Nov 2023 00:20:23 -0500 X-MC-Unique: jqBeySgNPeOwfaqAWlF5YA-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id CEDE285A58B; Fri, 10 Nov 2023 05:20:22 +0000 (UTC) Received: from elisabeth (unknown [10.39.208.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1E4362026D68; Fri, 10 Nov 2023 05:20:21 +0000 (UTC) Date: Fri, 10 Nov 2023 06:20:08 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 2/2] tcp: Don't use TCP_WINDOW_CLAMP Message-ID: <20231110062008.00355dee@elisabeth> In-Reply-To: References: <20231109095400.507679-1-david@gibson.dropbear.id.au> <20231109095400.507679-3-david@gibson.dropbear.id.au> <20231109154836.2d866c69@elisabeth> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: XMXRJWAUR2XTAPUMLJOZBEVCURSIVHBR X-Message-ID-Hash: XMXRJWAUR2XTAPUMLJOZBEVCURSIVHBR X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 10 Nov 2023 11:23:03 +1100 David Gibson wrote: > On Thu, Nov 09, 2023 at 03:51:36PM +0100, Stefano Brivio wrote: > > On Thu, 9 Nov 2023 20:54:00 +1100 > > David Gibson wrote: > > > > > On the L2 tap side, we see TCP headers and know the TCP window that the > > > ultimate receiver is advertising. In order to avoid unnecessary buffering > > > within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt > > > to advertise that window back to the original sock-side sender using > > > TCP_WINDOW_CLAMP. > > > > > > However, TCP_WINDOW_CLAMP just doesn't work like this. Prior to kernel > > > commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply > > > had no effect on established sockets. After that commit, it does affect > > > established sockets but doesn't behave the way we need: > > > * It appears to be designed only to shrink the window, not to allow it to > > > re-expand. > > > * More importantly, that commit has a serious bug where if the > > > setsockopt() is made when the existing kernel advertised window for the > > > socket happens to be zero, it will now become locked at zero, stopping > > > any further data from being received on the socket. > > > > > > Since this has never worked as intended, simply remove it. It might be > > > possible to re-implement the intended behaviour by manipulating SO_RCVBUF, > > > so we leave a comment to that effect. > > > > > > This kernel bug is the underlying cause of both the linked passt bug and > > > the linked podman bug. We attempted to fix this before with passt commit > > > d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag"). > > > However while that commit masked the bug for some cases, it didn't really > > > address the problem. > > > > > > Fixes: d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag") > > > Link: https://github.com/containers/podman/issues/20170 > > > Link: https://bugs.passt.top/show_bug.cgi?id=74 > > > > > > Signed-off-by: David Gibson > > > --- > > > tcp.c | 65 ++++++++---------------------------------------------- > > > tcp_conn.h | 7 +++--- > > > 2 files changed, 12 insertions(+), 60 deletions(-) > > > > > > diff --git a/tcp.c b/tcp.c > > > index 13e82ca..cfcd40a 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -46,8 +46,8 @@ > > > * - avoiding port and address translations whenever possible > > > * - mirroring TCP dynamics by observation of socket parameters (TCP_INFO > > > * socket option) and TCP headers of packets coming from the tap interface, > > > - * reapplying those parameters in both flow directions (including TCP_MSS, > > > - * TCP_WINDOW_CLAMP socket options) > > > + * reapplying those parameters in both flow directions (including TCP_MSS > > > + * socket option) > > > * - simplicity: only a small subset of TCP logic is implemented here and > > > * delegated as much as possible to the TCP implementations of guest and host > > > * kernel. This is achieved by: > > > @@ -236,12 +236,10 @@ > > > * - update @seq_ack_from_tap from ack_seq in header > > > * - on two duplicated ACKs, reset @seq_to_tap to @seq_ack_from_tap, and > > > * resend with steps listed above > > > - * - set TCP_WINDOW_CLAMP from TCP header from tap > > > * > > > * - from tap/guest to socket: > > > * - on packet from tap/guest: > > > * - set @ts_tap_act > > > - * - set TCP_WINDOW_CLAMP from TCP header from tap > > > * - check seq from header against @seq_from_tap, if data is missing, send > > > * two ACKs with number @seq_ack_to_tap, discard packet > > > * - otherwise queue data to socket, set @seq_from_tap to seq from header > > > @@ -420,7 +418,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = { > > > }; > > > > > > static const char *tcp_flag_str[] __attribute((__unused__)) = { > > > - "STALLED", "LOCAL", "WND_CLAMPED", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", > > > + "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", > > > "ACK_FROM_TAP_DUE", > > > }; > > > > > > @@ -1790,58 +1788,16 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, > > > > > > /** > > > * tcp_tap_window_update() - Process an updated window from tap side > > > - * @c: Execution context > > > * @conn: Connection pointer > > > * @window: Window value, host order, unscaled > > > */ > > > -static void tcp_tap_window_update(const struct ctx *c, > > > - struct tcp_tap_conn *conn, unsigned wnd) > > > +static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) > > > { > > > - uint32_t prev_scaled = conn->wnd_from_tap << conn->ws_from_tap; > > > - int s = conn->sock; > > > - > > > wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); > > > conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); > > > > > > - /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up > > > - * with a zero-sized window on a TCP socket, dropping data (once > > > - * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear > > > - * to be enough to make the kernel advertise a non-zero window to the > > > - * sender. Forcing a TCP_WINDOW_CLAMP setting, even with the existing > > > - * value, fixes this. > > > - * > > > - * The STALLED flag on a connection is a sufficient indication that we > > > - * might have a zero-sized window on the socket, because it's set if we > > > - * exhausted the tap-side window, or if everything we receive from a > > > - * socket is already in flight to the guest. > > > - * > > > - * So, if STALLED is set, and we received a window value from the tap, > > > - * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated > > > - * further and fixed in the kernel instead, if confirmed. > > > - */ > > > - if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { > > > - if (prev_scaled == wnd) > > > - return; > > > - > > > - /* Discard +/- 1% updates to spare some syscalls. */ > > > - /* TODO: cppcheck, starting from commit b4d455df487c ("Fix > > > - * 11349: FP negativeIndex for clamped array index (#4627)"), > > > - * reports wnd > prev_scaled as always being true, see also: > > > - * > > > - * https://github.com/danmar/cppcheck/pull/4627 > > > - * > > > - * drop this suppression once that's resolved. > > > - */ > > > - /* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */ > > > - if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) || > > > - (wnd < prev_scaled && wnd * 101 / 100 > prev_scaled)) > > > - return; > > > - } > > > - > > > - if (setsockopt(s, SOL_TCP, TCP_WINDOW_CLAMP, &wnd, sizeof(wnd))) > > > - trace("TCP: failed to set TCP_WINDOW_CLAMP on socket %i", s); > > > - > > > - conn_flag(c, conn, WND_CLAMPED); > > > + /* FIXME: reflect the tap-side receiver's window back to the sock-side > > > + * sender by adjusting SO_RCVBUF? */ > > > > If we adjust SO_RCVBUF, we lose the chance to get bigger receive > > buffers by means of auto-tuning, which is especially problematic with > > default values for rmem_max (in turn, the case which is giving us > > issues here). > > Yeah, I forgot about that :/. > > > I'd rather fix TCP_CLAMP_WINDOW in the kernel and then re-enable this > > in passt (I guess they'll be no obvious way to probe this, so we'll > > have to rely on the kernel version). > > I kind of hate checking kernel versions, because it means we can't > make use of fixes backported to distro kernels for example. > > Now that we understand the cause of the problem, I can see a way to > probe for it, but it's pretty ugly: > > 1. Create a pair of connected local TCP sockets A and B > 2. Set SO_SNDBUF on A and SO_RCVBUF on B as small as we're allowed > 3. Write into A until EAGAIN; since we didn't read B, we can now be > confident the window is zero > 4. Set TCP_CLAMP_WINDOW on B to a fairly large > 5. Read and discard from B until EAGAIN, we can now be confident > the buffer is empty > 6. Write to A: if we get EGAIN it looks like we're hitting this > problem I was also thinking of something similar, but if the fix involves not setting rcv_ssthresh directly (that's my current thought), then we can just open a TCP socket, read rcv_ssthresh via TCP_INFO, set a different value via TCP_WINDOW_CLAMP, check that rcv_ssthresh matches that value. -- Stefano