public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Ed Santiago <santiago@redhat.com>,
	Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read
Date: Fri, 25 Oct 2024 11:27:55 +1100	[thread overview]
Message-ID: <ZxrmC5oUuI8ysNmt@zatzit> (raw)
In-Reply-To: <20241024141110.2951221-1-sbrivio@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5807 bytes --]

On Thu, Oct 24, 2024 at 04:11:10PM +0200, Stefano Brivio wrote:
> In tcp_splice_sock_handler(), we try to calculate how much we can move
> from the pipe to the writing socket: if we just read some bytes, we'll
> use that amount, but if we haven't, we just try to empty the pipe.
> 
> However, if we just read something, that doesn't mean that that's all
> the data we have on the pipe, as it's obvious from this sequence, where:
> 
>   pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
>   Flow 0 (TCP connection (spliced)): 98304 from read-side call
>   Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
>   Flow 0 (TCP connection (spliced)): -1 from read-side call
>   Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
>   Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
>   Flow 0 (TCP connection (spliced)): OUT_WAIT_0
> 
> we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
> couldn't write, as the receiver buffer is full, and we set the
> corresponding OUT_WAIT flag. Then:
> 
>   pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
>   Flow 0 (TCP connection (spliced)): 32768 from read-side call
>   Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
>   Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
> 
> we splice() 32768 more bytes from our receiving side to the pipe. At
> some point:
> 
>   pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
>   Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
>   Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
>   Flow 0 (TCP connection (spliced)): 1320 from read-side call
>   Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)
> 
> the receiver is signalling to us that it's ready for more data
> (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
> our receiving socket into the pipe, and that's what we write to the
> receiver, forgetting about the pending 97457 bytes we had, which the
> receiver might never get (not the same 97547 bytes: we'll actually
> send 1320 of those).
> 
> This condition is rather hard to reproduce, and it was observed with
> Podman pulling container images via HTTPS. In the traces above, the
> client is side 0 (the initiating peer), and the server is sending the
> whole data.
> 
> Instead of splicing from pipe to socket the amount of data we just
> read, we need to splice all the pending data we piled up until that
> point. We could do that using 'read' and 'written' counters, but
> there's actually no need, as the kernel also keeps track of how much
> data is available on the pipe.
> 
> So, to make this simple and more robust, just give the whole pipe size
> as length to splice(). The kernel knows what to do with it.
> 
> Later in the function, we used 'to_write' for an optimisation meant
> to reduce wakeups which retries right away to splice() in both
> directions if we couldn't write to the receiver the whole amount of
> pending data. Calculate a 'pending' value instead, only if we reach
> that point.
> 
> Now that we check for the actual amount of pending data in that
> optimisation, we need to make sure we don't compare a zero or negative
> 'written' value: if we met that, it means that the receiver signalled
> end-of-file, an error, or to try again later. In those three cases,
> the optimisation doesn't make any sense, so skip it.
> 
> Reported-by: Ed Santiago <santiago@redhat.com>
> Reported-by: Paul Holzinger <pholzing@redhat.com>
> Analysed-by: Paul Holzinger <pholzing@redhat.com>
> Link: https://github.com/containers/podman/issues/24219
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp_splice.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 9f5cc27..f112cfe 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -503,7 +503,7 @@ swap:
>  	lowat_act_flag = RCVLOWAT_ACT(fromsidei);
>  
>  	while (1) {
> -		ssize_t readlen, to_write = 0, written;
> +		ssize_t readlen, written, pending;
>  		int more = 0;
>  
>  retry:
> @@ -518,14 +518,11 @@ retry:
>  
>  			if (errno != EAGAIN)
>  				goto close;
> -
> -			to_write = c->tcp.pipe_size;
>  		} else if (!readlen) {
>  			eof = 1;
> -			to_write = c->tcp.pipe_size;
>  		} else {
>  			never_read = 0;
> -			to_write += readlen;
> +
>  			if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
>  				more = SPLICE_F_MORE;
>  
> @@ -535,10 +532,10 @@ retry:
>  
>  eintr:
>  		written = splice(conn->pipe[fromsidei][0], NULL,
> -				 conn->s[!fromsidei], NULL, to_write,
> +				 conn->s[!fromsidei], NULL, c->tcp.pipe_size,
>  				 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
>  		flow_trace(conn, "%zi from write-side call (passed %zi)",
> -			   written, to_write);
> +			   written, c->tcp.pipe_size);
>  
>  		/* Most common case: skip updating counters. */
>  		if (readlen > 0 && readlen == written) {
> @@ -584,10 +581,9 @@ eintr:
>  		if (never_read && written == (long)(c->tcp.pipe_size))
>  			goto retry;
>  
> -		if (!never_read && written < to_write) {
> -			to_write -= written;
> +		pending = conn->read[fromsidei] - conn->written[fromsidei];
> +		if (!never_read && written > 0 && written < pending)
>  			goto retry;
> -		}
>  
>  		if (eof)
>  			break;

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      reply	other threads:[~2024-10-25  0:28 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-24 14:11 [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read Stefano Brivio
2024-10-25  0:27 ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZxrmC5oUuI8ysNmt@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=pholzing@redhat.com \
    --cc=santiago@redhat.com \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).