From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: David Gibson <david@gibson.dropbear.id.au>,
Max Chernoff <git@maxchernoff.ca>,
Tyler Cloud <tcloud@redhat.com>
Subject: Re: [PATCH] tcp: Use less-than-MSS window on no queued data, or no data sent recently
Date: Sat, 13 Dec 2025 15:31:08 +0100 [thread overview]
Message-ID: <20251213153108.1b7b54e7@elisabeth> (raw)
In-Reply-To: <20251213142540.1319527-1-sbrivio@redhat.com>
On Sat, 13 Dec 2025 15:25:40 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:
> We limit the advertised window to guests and containers to the
> available length of the sending buffer, and if it's less than the MSS,
> since commit cf1925fb7b77 ("tcp: Don't limit window to less-than-MSS
> values, use zero instead"), we approximate that limit to zero.
>
> This way, we'll trigger a window update as soon as we realise that we
> can advertise a larger value, just like we do in all other cases where
> we advertise a zero-sized window.
>
> By doing that, we don't wait for the peer to send us data before we
> update the window. This matters because the guest or container might
> be trying to aggregate more data and won't send us anything at all if
> the advertised window is too small.
>
> However, this might be problematic in two situations:
>
> 1. one, reported by Tyler, where the remote (receiving) peer
> advertises a window that's smaller than what we usually get and
> very close to the MSS, causing the kernel to give us a starting
> size of the buffer that's less than the MSS we advertise to the
> guest or container.
>
> If this happens, we'll never advertise a non-zero window after
> the handshake, and the container or guest will never send us any
> data at all.
>
> With a simple 'curl https://cloudflare.com/', we get, with default
> TCP memory parameters, a 65535-byte window from the peer, and 46080
> bytes of initial sending buffer from the kernel. But we advertised
> a 65480-byte MSS, and we'll never actually receive the client
> request.
>
> This seems to be specific to Cloudflare for some reason, probably
> deriving from a particular tuning of TCP parameters on their
> servers.
>
> 2. another one, hypothesised by David, where the peer might only be
> willing to process (and acknowledge) data in batches.
>
> We might have queued outbound data which is, at the same time, not
> enough to fill one of these batches and be acknowledged and removed
> from the sending queue, but enough to make our available buffer
> smaller than the MSS, and the connection will hang.
>
> Take care of both cases by:
>
> a. not approximating the sending buffer to zero if we have no outboud
> queued data at all, because in that case we don't expect the
> available buffer to increase if we don't send any data, so there's
> no point in waiting for it to grow larger than the MSS.
>
> This fixes problem 1. above.
>
> b. also using the full sending buffer size if we haven't send data to
> the socket for a while (reported by tcpi_last_data_sent). This part
> was already suggested by David in:
>
> https://archives.passt.top/passt-dev/aTZzgtcKWLb28zrf@zatzit/
>
> and I'm now picking ten times the RTT as a somewhat arbitrary
> threshold.
>
> This is meant to take care of potential problem 2. above, but it
> also happens to fix 1.
>
> Reported-by: Tyler Cloud <tcloud@redhat.com>
> Link: https://bugs.passt.top/show_bug.cgi?id=183
And, I forgot:
Fixes: cf1925fb7b77 ("tcp: Don't limit window to less-than-MSS values, use zero instead")
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> tcp.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/tcp.c b/tcp.c
> index 81bc114..b179e39 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1211,8 +1211,21 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> * the MSS to zero, as we already have mechanisms in place to
> * force updates after the window becomes zero. This matches the
> * suggestion from RFC 813, Section 4.
> + *
> + * But don't do this if, either:
> + *
> + * - there's nothing in the outbound queue: the size of the
> + * sending buffer is limiting us, and it won't increase if we
> + * don't send data, so there's no point in waiting, or
> + *
> + * - we haven't sent data in a while (somewhat arbitrarily, ten
> + * times the RTT), as that might indicate that the receiver
> + * will only process data in batches that are large enough,
> + * but we won't send enough to fill one because we're stuck
> + * with pending data in the outbound queue
> */
> - if (limit < MSS_GET(conn))
> + if (limit < MSS_GET(conn) && sendq &&
> + tinfo->tcpi_last_data_sent < tinfo->tcpi_rtt / 1000 * 10)
> limit = 0;
>
> new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
--
Stefano
next prev parent reply other threads:[~2025-12-13 14:31 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-13 14:25 Stefano Brivio
2025-12-13 14:31 ` Stefano Brivio [this message]
2025-12-15 2:02 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251213153108.1b7b54e7@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=git@maxchernoff.ca \
--cc=passt-dev@passt.top \
--cc=tcloud@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).