From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Ed Santiago <santiago@redhat.com>,
Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read
Date: Fri, 25 Oct 2024 11:27:55 +1100 [thread overview]
Message-ID: <ZxrmC5oUuI8ysNmt@zatzit> (raw)
In-Reply-To: <20241024141110.2951221-1-sbrivio@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 5807 bytes --]
On Thu, Oct 24, 2024 at 04:11:10PM +0200, Stefano Brivio wrote:
> In tcp_splice_sock_handler(), we try to calculate how much we can move
> from the pipe to the writing socket: if we just read some bytes, we'll
> use that amount, but if we haven't, we just try to empty the pipe.
>
> However, if we just read something, that doesn't mean that that's all
> the data we have on the pipe, as it's obvious from this sequence, where:
>
> pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
> Flow 0 (TCP connection (spliced)): 98304 from read-side call
> Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
> Flow 0 (TCP connection (spliced)): -1 from read-side call
> Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
> Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
> Flow 0 (TCP connection (spliced)): OUT_WAIT_0
>
> we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
> couldn't write, as the receiver buffer is full, and we set the
> corresponding OUT_WAIT flag. Then:
>
> pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
> Flow 0 (TCP connection (spliced)): 32768 from read-side call
> Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
> Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
>
> we splice() 32768 more bytes from our receiving side to the pipe. At
> some point:
>
> pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
> Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
> Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
> Flow 0 (TCP connection (spliced)): 1320 from read-side call
> Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)
>
> the receiver is signalling to us that it's ready for more data
> (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
> our receiving socket into the pipe, and that's what we write to the
> receiver, forgetting about the pending 97457 bytes we had, which the
> receiver might never get (not the same 97547 bytes: we'll actually
> send 1320 of those).
>
> This condition is rather hard to reproduce, and it was observed with
> Podman pulling container images via HTTPS. In the traces above, the
> client is side 0 (the initiating peer), and the server is sending the
> whole data.
>
> Instead of splicing from pipe to socket the amount of data we just
> read, we need to splice all the pending data we piled up until that
> point. We could do that using 'read' and 'written' counters, but
> there's actually no need, as the kernel also keeps track of how much
> data is available on the pipe.
>
> So, to make this simple and more robust, just give the whole pipe size
> as length to splice(). The kernel knows what to do with it.
>
> Later in the function, we used 'to_write' for an optimisation meant
> to reduce wakeups which retries right away to splice() in both
> directions if we couldn't write to the receiver the whole amount of
> pending data. Calculate a 'pending' value instead, only if we reach
> that point.
>
> Now that we check for the actual amount of pending data in that
> optimisation, we need to make sure we don't compare a zero or negative
> 'written' value: if we met that, it means that the receiver signalled
> end-of-file, an error, or to try again later. In those three cases,
> the optimisation doesn't make any sense, so skip it.
>
> Reported-by: Ed Santiago <santiago@redhat.com>
> Reported-by: Paul Holzinger <pholzing@redhat.com>
> Analysed-by: Paul Holzinger <pholzing@redhat.com>
> Link: https://github.com/containers/podman/issues/24219
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 16 ++++++----------
> 1 file changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 9f5cc27..f112cfe 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -503,7 +503,7 @@ swap:
> lowat_act_flag = RCVLOWAT_ACT(fromsidei);
>
> while (1) {
> - ssize_t readlen, to_write = 0, written;
> + ssize_t readlen, written, pending;
> int more = 0;
>
> retry:
> @@ -518,14 +518,11 @@ retry:
>
> if (errno != EAGAIN)
> goto close;
> -
> - to_write = c->tcp.pipe_size;
> } else if (!readlen) {
> eof = 1;
> - to_write = c->tcp.pipe_size;
> } else {
> never_read = 0;
> - to_write += readlen;
> +
> if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
> more = SPLICE_F_MORE;
>
> @@ -535,10 +532,10 @@ retry:
>
> eintr:
> written = splice(conn->pipe[fromsidei][0], NULL,
> - conn->s[!fromsidei], NULL, to_write,
> + conn->s[!fromsidei], NULL, c->tcp.pipe_size,
> SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
> flow_trace(conn, "%zi from write-side call (passed %zi)",
> - written, to_write);
> + written, c->tcp.pipe_size);
>
> /* Most common case: skip updating counters. */
> if (readlen > 0 && readlen == written) {
> @@ -584,10 +581,9 @@ eintr:
> if (never_read && written == (long)(c->tcp.pipe_size))
> goto retry;
>
> - if (!never_read && written < to_write) {
> - to_write -= written;
> + pending = conn->read[fromsidei] - conn->written[fromsidei];
> + if (!never_read && written > 0 && written < pending)
> goto retry;
> - }
>
> if (eof)
> break;
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
prev parent reply other threads:[~2024-10-25 0:28 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-24 14:11 [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read Stefano Brivio
2024-10-25 0:27 ` David Gibson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZxrmC5oUuI8ysNmt@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=pholzing@redhat.com \
--cc=santiago@redhat.com \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).