On Thu, Oct 24, 2024 at 04:11:10PM +0200, Stefano Brivio wrote: > In tcp_splice_sock_handler(), we try to calculate how much we can move > from the pipe to the writing socket: if we just read some bytes, we'll > use that amount, but if we haven't, we just try to empty the pipe. > > However, if we just read something, that doesn't mean that that's all > the data we have on the pipe, as it's obvious from this sequence, where: > > pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) > Flow 0 (TCP connection (spliced)): 98304 from read-side call > Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304) > Flow 0 (TCP connection (spliced)): -1 from read-side call > Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288) > Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 > Flow 0 (TCP connection (spliced)): OUT_WAIT_0 > > we first pile up 98304 - 33615 = 64689 pending bytes, that we read but > couldn't write, as the receiver buffer is full, and we set the > corresponding OUT_WAIT flag. Then: > > pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001) > Flow 0 (TCP connection (spliced)): 32768 from read-side call > Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768) > Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 > > we splice() 32768 more bytes from our receiving side to the pipe. At > some point: > > pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004) > Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489 > Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0 > Flow 0 (TCP connection (spliced)): 1320 from read-side call > Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320) > > the receiver is signalling to us that it's ready for more data > (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from > our receiving socket into the pipe, and that's what we write to the > receiver, forgetting about the pending 97457 bytes we had, which the > receiver might never get (not the same 97547 bytes: we'll actually > send 1320 of those). > > This condition is rather hard to reproduce, and it was observed with > Podman pulling container images via HTTPS. In the traces above, the > client is side 0 (the initiating peer), and the server is sending the > whole data. > > Instead of splicing from pipe to socket the amount of data we just > read, we need to splice all the pending data we piled up until that > point. We could do that using 'read' and 'written' counters, but > there's actually no need, as the kernel also keeps track of how much > data is available on the pipe. > > So, to make this simple and more robust, just give the whole pipe size > as length to splice(). The kernel knows what to do with it. > > Later in the function, we used 'to_write' for an optimisation meant > to reduce wakeups which retries right away to splice() in both > directions if we couldn't write to the receiver the whole amount of > pending data. Calculate a 'pending' value instead, only if we reach > that point. > > Now that we check for the actual amount of pending data in that > optimisation, we need to make sure we don't compare a zero or negative > 'written' value: if we met that, it means that the receiver signalled > end-of-file, an error, or to try again later. In those three cases, > the optimisation doesn't make any sense, so skip it. > > Reported-by: Ed Santiago > Reported-by: Paul Holzinger > Analysed-by: Paul Holzinger > Link: https://github.com/containers/podman/issues/24219 > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson > --- > tcp_splice.c | 16 ++++++---------- > 1 file changed, 6 insertions(+), 10 deletions(-) > > diff --git a/tcp_splice.c b/tcp_splice.c > index 9f5cc27..f112cfe 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -503,7 +503,7 @@ swap: > lowat_act_flag = RCVLOWAT_ACT(fromsidei); > > while (1) { > - ssize_t readlen, to_write = 0, written; > + ssize_t readlen, written, pending; > int more = 0; > > retry: > @@ -518,14 +518,11 @@ retry: > > if (errno != EAGAIN) > goto close; > - > - to_write = c->tcp.pipe_size; > } else if (!readlen) { > eof = 1; > - to_write = c->tcp.pipe_size; > } else { > never_read = 0; > - to_write += readlen; > + > if (readlen >= (long)c->tcp.pipe_size * 90 / 100) > more = SPLICE_F_MORE; > > @@ -535,10 +532,10 @@ retry: > > eintr: > written = splice(conn->pipe[fromsidei][0], NULL, > - conn->s[!fromsidei], NULL, to_write, > + conn->s[!fromsidei], NULL, c->tcp.pipe_size, > SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK); > flow_trace(conn, "%zi from write-side call (passed %zi)", > - written, to_write); > + written, c->tcp.pipe_size); > > /* Most common case: skip updating counters. */ > if (readlen > 0 && readlen == written) { > @@ -584,10 +581,9 @@ eintr: > if (never_read && written == (long)(c->tcp.pipe_size)) > goto retry; > > - if (!never_read && written < to_write) { > - to_write -= written; > + pending = conn->read[fromsidei] - conn->written[fromsidei]; > + if (!never_read && written > 0 && written < pending) > goto retry; > - } > > if (eof) > break; -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson