From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 2/2] tcp: Don't use TCP_WINDOW_CLAMP
Date: Fri, 10 Nov 2023 06:20:08 +0100 [thread overview]
Message-ID: <20231110062008.00355dee@elisabeth> (raw)
In-Reply-To: <ZU1356dn09rBwUop@zatzit>
On Fri, 10 Nov 2023 11:23:03 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Thu, Nov 09, 2023 at 03:51:36PM +0100, Stefano Brivio wrote:
> > On Thu, 9 Nov 2023 20:54:00 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On the L2 tap side, we see TCP headers and know the TCP window that the
> > > ultimate receiver is advertising. In order to avoid unnecessary buffering
> > > within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
> > > to advertise that window back to the original sock-side sender using
> > > TCP_WINDOW_CLAMP.
> > >
> > > However, TCP_WINDOW_CLAMP just doesn't work like this. Prior to kernel
> > > commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
> > > had no effect on established sockets. After that commit, it does affect
> > > established sockets but doesn't behave the way we need:
> > > * It appears to be designed only to shrink the window, not to allow it to
> > > re-expand.
> > > * More importantly, that commit has a serious bug where if the
> > > setsockopt() is made when the existing kernel advertised window for the
> > > socket happens to be zero, it will now become locked at zero, stopping
> > > any further data from being received on the socket.
> > >
> > > Since this has never worked as intended, simply remove it. It might be
> > > possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
> > > so we leave a comment to that effect.
> > >
> > > This kernel bug is the underlying cause of both the linked passt bug and
> > > the linked podman bug. We attempted to fix this before with passt commit
> > > d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
> > > However while that commit masked the bug for some cases, it didn't really
> > > address the problem.
> > >
> > > Fixes: d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
> > > Link: https://github.com/containers/podman/issues/20170
> > > Link: https://bugs.passt.top/show_bug.cgi?id=74
> > >
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > > tcp.c | 65 ++++++++----------------------------------------------
> > > tcp_conn.h | 7 +++---
> > > 2 files changed, 12 insertions(+), 60 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 13e82ca..cfcd40a 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -46,8 +46,8 @@
> > > * - avoiding port and address translations whenever possible
> > > * - mirroring TCP dynamics by observation of socket parameters (TCP_INFO
> > > * socket option) and TCP headers of packets coming from the tap interface,
> > > - * reapplying those parameters in both flow directions (including TCP_MSS,
> > > - * TCP_WINDOW_CLAMP socket options)
> > > + * reapplying those parameters in both flow directions (including TCP_MSS
> > > + * socket option)
> > > * - simplicity: only a small subset of TCP logic is implemented here and
> > > * delegated as much as possible to the TCP implementations of guest and host
> > > * kernel. This is achieved by:
> > > @@ -236,12 +236,10 @@
> > > * - update @seq_ack_from_tap from ack_seq in header
> > > * - on two duplicated ACKs, reset @seq_to_tap to @seq_ack_from_tap, and
> > > * resend with steps listed above
> > > - * - set TCP_WINDOW_CLAMP from TCP header from tap
> > > *
> > > * - from tap/guest to socket:
> > > * - on packet from tap/guest:
> > > * - set @ts_tap_act
> > > - * - set TCP_WINDOW_CLAMP from TCP header from tap
> > > * - check seq from header against @seq_from_tap, if data is missing, send
> > > * two ACKs with number @seq_ack_to_tap, discard packet
> > > * - otherwise queue data to socket, set @seq_from_tap to seq from header
> > > @@ -420,7 +418,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
> > > };
> > >
> > > static const char *tcp_flag_str[] __attribute((__unused__)) = {
> > > - "STALLED", "LOCAL", "WND_CLAMPED", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> > > + "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> > > "ACK_FROM_TAP_DUE",
> > > };
> > >
> > > @@ -1790,58 +1788,16 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
> > >
> > > /**
> > > * tcp_tap_window_update() - Process an updated window from tap side
> > > - * @c: Execution context
> > > * @conn: Connection pointer
> > > * @window: Window value, host order, unscaled
> > > */
> > > -static void tcp_tap_window_update(const struct ctx *c,
> > > - struct tcp_tap_conn *conn, unsigned wnd)
> > > +static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> > > {
> > > - uint32_t prev_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> > > - int s = conn->sock;
> > > -
> > > wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);
> > > conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);
> > >
> > > - /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
> > > - * with a zero-sized window on a TCP socket, dropping data (once
> > > - * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
> > > - * to be enough to make the kernel advertise a non-zero window to the
> > > - * sender. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
> > > - * value, fixes this.
> > > - *
> > > - * The STALLED flag on a connection is a sufficient indication that we
> > > - * might have a zero-sized window on the socket, because it's set if we
> > > - * exhausted the tap-side window, or if everything we receive from a
> > > - * socket is already in flight to the guest.
> > > - *
> > > - * So, if STALLED is set, and we received a window value from the tap,
> > > - * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
> > > - * further and fixed in the kernel instead, if confirmed.
> > > - */
> > > - if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
> > > - if (prev_scaled == wnd)
> > > - return;
> > > -
> > > - /* Discard +/- 1% updates to spare some syscalls. */
> > > - /* TODO: cppcheck, starting from commit b4d455df487c ("Fix
> > > - * 11349: FP negativeIndex for clamped array index (#4627)"),
> > > - * reports wnd > prev_scaled as always being true, see also:
> > > - *
> > > - * https://github.com/danmar/cppcheck/pull/4627
> > > - *
> > > - * drop this suppression once that's resolved.
> > > - */
> > > - /* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */
> > > - if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) ||
> > > - (wnd < prev_scaled && wnd * 101 / 100 > prev_scaled))
> > > - return;
> > > - }
> > > -
> > > - if (setsockopt(s, SOL_TCP, TCP_WINDOW_CLAMP, &wnd, sizeof(wnd)))
> > > - trace("TCP: failed to set TCP_WINDOW_CLAMP on socket %i", s);
> > > -
> > > - conn_flag(c, conn, WND_CLAMPED);
> > > + /* FIXME: reflect the tap-side receiver's window back to the sock-side
> > > + * sender by adjusting SO_RCVBUF? */
> >
> > If we adjust SO_RCVBUF, we lose the chance to get bigger receive
> > buffers by means of auto-tuning, which is especially problematic with
> > default values for rmem_max (in turn, the case which is giving us
> > issues here).
>
> Yeah, I forgot about that :/.
>
> > I'd rather fix TCP_CLAMP_WINDOW in the kernel and then re-enable this
> > in passt (I guess they'll be no obvious way to probe this, so we'll
> > have to rely on the kernel version).
>
> I kind of hate checking kernel versions, because it means we can't
> make use of fixes backported to distro kernels for example.
>
> Now that we understand the cause of the problem, I can see a way to
> probe for it, but it's pretty ugly:
>
> 1. Create a pair of connected local TCP sockets A and B
> 2. Set SO_SNDBUF on A and SO_RCVBUF on B as small as we're allowed
> 3. Write into A until EAGAIN; since we didn't read B, we can now be
> confident the window is zero
> 4. Set TCP_CLAMP_WINDOW on B to a fairly large
> 5. Read and discard from B until EAGAIN, we can now be confident
> the buffer is empty
> 6. Write to A: if we get EGAIN it looks like we're hitting this
> problem
I was also thinking of something similar, but if the fix involves not
setting rcv_ssthresh directly (that's my current thought), then we can
just open a TCP socket, read rcv_ssthresh via TCP_INFO, set a different
value via TCP_WINDOW_CLAMP, check that rcv_ssthresh matches that value.
--
Stefano
next prev parent reply other threads:[~2023-11-10 5:20 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-09 9:53 [PATCH 0/2] Avoid bugs related to TCP_WINDOW_CLAMP David Gibson
2023-11-09 9:53 ` [PATCH 1/2] tcp: Rename and small cleanup to tcp_clamp_window() David Gibson
2023-11-09 9:54 ` [PATCH 2/2] tcp: Don't use TCP_WINDOW_CLAMP David Gibson
2023-11-09 14:51 ` Stefano Brivio
2023-11-10 0:23 ` David Gibson
2023-11-10 5:20 ` Stefano Brivio [this message]
2023-11-10 17:06 ` [PATCH 0/2] Avoid bugs related to TCP_WINDOW_CLAMP Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231110062008.00355dee@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).