public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 2/2] tcp: Don't use TCP_WINDOW_CLAMP
Date: Fri, 10 Nov 2023 06:20:08 +0100	[thread overview]
Message-ID: <20231110062008.00355dee@elisabeth> (raw)
In-Reply-To: <ZU1356dn09rBwUop@zatzit>

On Fri, 10 Nov 2023 11:23:03 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Nov 09, 2023 at 03:51:36PM +0100, Stefano Brivio wrote:
> > On Thu,  9 Nov 2023 20:54:00 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On the L2 tap side, we see TCP headers and know the TCP window that the
> > > ultimate receiver is advertising.  In order to avoid unnecessary buffering
> > > within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
> > > to advertise that window back to the original sock-side sender using
> > > TCP_WINDOW_CLAMP.
> > > 
> > > However, TCP_WINDOW_CLAMP just doesn't work like this.  Prior to kernel
> > > commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
> > > had no effect on established sockets.  After that commit, it does affect
> > > established sockets but doesn't behave the way we need:
> > >   * It appears to be designed only to shrink the window, not to allow it to
> > >     re-expand.
> > >   * More importantly, that commit has a serious bug where if the
> > >     setsockopt() is made when the existing kernel advertised window for the
> > >     socket happens to be zero, it will now become locked at zero, stopping
> > >     any further data from being received on the socket.
> > > 
> > > Since this has never worked as intended, simply remove it.  It might be
> > > possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
> > > so we leave a comment to that effect.
> > > 
> > > This kernel bug is the underlying cause of both the linked passt bug and
> > > the linked podman bug.  We attempted to fix this before with passt commit
> > > d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
> > > However while that commit masked the bug for some cases, it didn't really
> > > address the problem.
> > > 
> > > Fixes: d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
> > > Link: https://github.com/containers/podman/issues/20170
> > > Link: https://bugs.passt.top/show_bug.cgi?id=74
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  tcp.c      | 65 ++++++++----------------------------------------------
> > >  tcp_conn.h |  7 +++---
> > >  2 files changed, 12 insertions(+), 60 deletions(-)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index 13e82ca..cfcd40a 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -46,8 +46,8 @@
> > >   *   - avoiding port and address translations whenever possible
> > >   *   - mirroring TCP dynamics by observation of socket parameters (TCP_INFO
> > >   *     socket option) and TCP headers of packets coming from the tap interface,
> > > - *     reapplying those parameters in both flow directions (including TCP_MSS,
> > > - *     TCP_WINDOW_CLAMP socket options)
> > > + *     reapplying those parameters in both flow directions (including TCP_MSS
> > > + *     socket option)
> > >   * - simplicity: only a small subset of TCP logic is implemented here and
> > >   *   delegated as much as possible to the TCP implementations of guest and host
> > >   *   kernel. This is achieved by:
> > > @@ -236,12 +236,10 @@
> > >   *     - update @seq_ack_from_tap from ack_seq in header
> > >   *     - on two duplicated ACKs, reset @seq_to_tap to @seq_ack_from_tap, and
> > >   *       resend with steps listed above
> > > - *     - set TCP_WINDOW_CLAMP from TCP header from tap
> > >   *
> > >   * - from tap/guest to socket:
> > >   *   - on packet from tap/guest:
> > >   *     - set @ts_tap_act
> > > - *     - set TCP_WINDOW_CLAMP from TCP header from tap
> > >   *     - check seq from header against @seq_from_tap, if data is missing, send
> > >   *       two ACKs with number @seq_ack_to_tap, discard packet
> > >   *     - otherwise queue data to socket, set @seq_from_tap to seq from header
> > > @@ -420,7 +418,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
> > >  };
> > >  
> > >  static const char *tcp_flag_str[] __attribute((__unused__)) = {
> > > -	"STALLED", "LOCAL", "WND_CLAMPED", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> > > +	"STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> > >  	"ACK_FROM_TAP_DUE",
> > >  };
> > >  
> > > @@ -1790,58 +1788,16 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
> > >  
> > >  /**
> > >   * tcp_tap_window_update() - Process an updated window from tap side
> > > - * @c:		Execution context
> > >   * @conn:	Connection pointer
> > >   * @window:	Window value, host order, unscaled
> > >   */
> > > -static void tcp_tap_window_update(const struct ctx *c,
> > > -				  struct tcp_tap_conn *conn, unsigned wnd)
> > > +static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> > >  {
> > > -	uint32_t prev_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> > > -	int s = conn->sock;
> > > -
> > >  	wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
> > >  	conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
> > >  
> > > -	/* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
> > > -	 * with a zero-sized window on a TCP socket, dropping data (once
> > > -	 * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
> > > -	 * to be enough to make the kernel advertise a non-zero window to the
> > > -	 * sender. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
> > > -	 * value, fixes this.
> > > -	 *
> > > -	 * The STALLED flag on a connection is a sufficient indication that we
> > > -	 * might have a zero-sized window on the socket, because it's set if we
> > > -	 * exhausted the tap-side window, or if everything we receive from a
> > > -	 * socket is already in flight to the guest.
> > > -	 *
> > > -	 * So, if STALLED is set, and we received a window value from the tap,
> > > -	 * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
> > > -	 * further and fixed in the kernel instead, if confirmed.
> > > -	 */
> > > -	if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
> > > -		if (prev_scaled == wnd)
> > > -			return;
> > > -
> > > -		/* Discard +/- 1% updates to spare some syscalls. */
> > > -		/* TODO: cppcheck, starting from commit b4d455df487c ("Fix
> > > -		 * 11349: FP negativeIndex for clamped array index (#4627)"),
> > > -		 * reports wnd > prev_scaled as always being true, see also:
> > > -		 *
> > > -		 *	https://github.com/danmar/cppcheck/pull/4627
> > > -		 *
> > > -		 * drop this suppression once that's resolved.
> > > -		 */
> > > -		/* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */
> > > -		if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) ||
> > > -		    (wnd < prev_scaled && wnd * 101 / 100 > prev_scaled))
> > > -			return;
> > > -	}
> > > -
> > > -	if (setsockopt(s, SOL_TCP, TCP_WINDOW_CLAMP, &wnd, sizeof(wnd)))
> > > -		trace("TCP: failed to set TCP_WINDOW_CLAMP on socket %i", s);
> > > -
> > > -	conn_flag(c, conn, WND_CLAMPED);
> > > +	/* FIXME: reflect the tap-side receiver's window back to the sock-side
> > > +	 * sender by adjusting SO_RCVBUF? */  
> > 
> > If we adjust SO_RCVBUF, we lose the chance to get bigger receive
> > buffers by means of auto-tuning, which is especially problematic with
> > default values for rmem_max (in turn, the case which is giving us
> > issues here).  
> 
> Yeah, I forgot about that :/.
> 
> > I'd rather fix TCP_CLAMP_WINDOW in the kernel and then re-enable this
> > in passt (I guess they'll be no obvious way to probe this, so we'll
> > have to rely on the kernel version).  
> 
> I kind of hate checking kernel versions, because it means we can't
> make use of fixes backported to distro kernels for example.
> 
> Now that we understand the cause of the problem, I can see a way to
> probe for it, but it's pretty ugly:
> 
>    1. Create a pair of connected local TCP sockets A and B
>    2. Set SO_SNDBUF on A and SO_RCVBUF on B as small as we're allowed
>    3. Write into A until EAGAIN; since we didn't read B, we can now be
>       confident the window is zero
>    4. Set TCP_CLAMP_WINDOW on B to a fairly large
>    5. Read and discard from B until EAGAIN, we can now be confident
>       the buffer is empty
>    6. Write to A: if we get EGAIN it looks like we're hitting this
>       problem

I was also thinking of something similar, but if the fix involves not
setting rcv_ssthresh directly (that's my current thought), then we can
just open a TCP socket, read rcv_ssthresh via TCP_INFO, set a different
value via TCP_WINDOW_CLAMP, check that rcv_ssthresh matches that value.

-- 
Stefano


  reply	other threads:[~2023-11-10  5:20 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-09  9:53 [PATCH 0/2] Avoid bugs related to TCP_WINDOW_CLAMP David Gibson
2023-11-09  9:53 ` [PATCH 1/2] tcp: Rename and small cleanup to tcp_clamp_window() David Gibson
2023-11-09  9:54 ` [PATCH 2/2] tcp: Don't use TCP_WINDOW_CLAMP David Gibson
2023-11-09 14:51   ` Stefano Brivio
2023-11-10  0:23     ` David Gibson
2023-11-10  5:20       ` Stefano Brivio [this message]
2023-11-10 17:06 ` [PATCH 0/2] Avoid bugs related to TCP_WINDOW_CLAMP Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231110062008.00355dee@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).