public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Matej Hrica <mhrica@redhat.com>
Subject: Re: [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
Date: Tue, 3 Oct 2023 13:47:13 +1100	[thread overview]
Message-ID: <ZRuAsfoj8JGjGb5S@zatzit> (raw)
In-Reply-To: <20230929150446.2671959-3-sbrivio@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4875 bytes --]

On Fri, Sep 29, 2023 at 05:04:45PM +0200, Stefano Brivio wrote:
> It looks like we need it as workaround for this situation, readily
> reproducible at least with a 6.5 Linux kernel, with default rmem_max
> and wmem_max values:
> 
> - an iperf3 client on the host sends about 160 KiB, typically
>   segmented into five frames by passt. We read this data using
>   MSG_PEEK
> 
> - the iperf3 server on the guest starts receiving
> 
> - meanwhile, the host kernel advertised a zero-sized window to the
>   receiver, as expected
> 
> - eventually, the guest acknowledges all the data sent so far, and
>   we drop it from the buffer, courtesy of tcp_sock_consume(), using
>   recv() with MSG_TRUNC
> 
> - the client, however, doesn't get an updated window value, and
>   even keepalive packets are answered with zero-window segments,
>   until the connection is closed
> 
> It looks like dropping data from a socket using MSG_TRUNC doesn't
> cause a recalculation of the window, which would be expected as a
> result of any receiving operation that invalidates data on a buffer
> (that is, not with MSG_PEEK).
> 
> Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
> the previous value we clamped to, forces a recalculation of the
> window which is advertised to the guest.
> 
> I couldn't quite confirm this issue by following all the possible
> code paths in the kernel, yet. If confirmed, this should be fixed in
> the kernel, but meanwhile this workaround looks robust to me (and it
> will be needed for backward compatibility anyway).
> 
> Reported-by: Matej Hrica <mhrica@redhat.com>
> Link: https://bugs.passt.top/show_bug.cgi?id=74
> Analysed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c | 29 ++++++++++++++++++++++++-----
>  1 file changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index dff5e79..32917c8 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
>  	wnd <<= conn->ws_from_tap;
>  	wnd = MIN(MAX_WINDOW, wnd);
>  
> -	if (conn->flags & WND_CLAMPED) {
> +	/* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up
> +	 * with a zero-sized window on a TCP socket, dropping data (once
> +	 * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear
> +	 * to be enough to make the kernel advertise a non-zero window to the
> +	 * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing
> +	 * value, fixes this.
> +	 *
> +	 * The STALLED flag on a connection is a sufficient indication that we
> +	 * might have a zero-sized window on the socket, because it's set if we
> +	 * exhausted the tap-side window, or if everything we receive from a
> +	 * socket is already in flight to the guest.
> +	 *
> +	 * So, if STALLED is set, and we received a window value from the tap,
> +	 * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated
> +	 * further and fixed in the kernel instead, if confirmed.
> +	 */
> +	if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) {
>  		if (prev_scaled == wnd)
>  			return;
>  
> @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  			i = keep - 1;
>  	}
>  
> -	tcp_clamp_window(c, conn, max_ack_seq_wnd);
> -
>  	/* On socket flush failure, pretend there was no ACK, try again later */
>  	if (ack && !tcp_sock_consume(conn, max_ack_seq))
>  		tcp_update_seqack_from_tap(c, conn, max_ack_seq);
>  
> +	tcp_clamp_window(c, conn, max_ack_seq_wnd);
> +
>  	if (retr) {
>  		trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u",
>  		      max_ack_seq, conn->seq_to_tap);
> @@ -2572,8 +2588,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
>  	if (th->ack && !(conn->events & ESTABLISHED))
>  		tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
>  
> -	conn_flag(c, conn, ~STALLED);
> -
>  	/* Establishing connection from socket */
>  	if (conn->events & SOCK_ACCEPTED) {
>  		if (th->syn && th->ack && !th->fin) {
> @@ -2628,6 +2642,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr,
>  	if (count == -1)
>  		goto reset;
>  
> +	/* Note: STALLED matters for tcp_clamp_window(): unset it only after
> +	 * processing data (and window) from the tap side
> +	 */
> +	conn_flag(c, conn, ~STALLED);
> +
>  	if (conn->seq_ack_to_tap != conn->seq_from_tap)
>  		ack_due = 1;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2023-10-03  2:50 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-29 15:04 [PATCH v2 0/3] Fixes and a workaround for TCP stalls with small buffers Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 1/3] tcp: Fix comment to tcp_sock_consume() Stefano Brivio
2023-09-29 15:04 ` [PATCH v2 2/3] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Stefano Brivio
2023-09-29 15:44   ` Stefano Brivio
2023-10-03  2:47   ` David Gibson [this message]
2023-09-29 15:04 ` [PATCH v2 3/3] tcp, tap: Don't increase tap-side sequence counter for dropped frames Stefano Brivio
2023-10-03  2:50   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRuAsfoj8JGjGb5S@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=mhrica@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).