public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read
@ 2024-10-24 14:11 Stefano Brivio
  2024-10-25  0:27 ` David Gibson
  0 siblings, 1 reply; 2+ messages in thread
From: Stefano Brivio @ 2024-10-24 14:11 UTC (permalink / raw)
  To: passt-dev; +Cc: Ed Santiago, Paul Holzinger

In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.

However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 98304 from read-side call
  Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
  Flow 0 (TCP connection (spliced)): -1 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
  Flow 0 (TCP connection (spliced)): OUT_WAIT_0

we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 32768 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580

we splice() 32768 more bytes from our receiving side to the pipe. At
some point:

  pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
  Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
  Flow 0 (TCP connection (spliced)): 1320 from read-side call
  Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)

the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).

This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.

Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.

So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.

Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.

Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.

Reported-by: Ed Santiago <santiago@redhat.com>
Reported-by: Paul Holzinger <pholzing@redhat.com>
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_splice.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 9f5cc27..f112cfe 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -503,7 +503,7 @@ swap:
 	lowat_act_flag = RCVLOWAT_ACT(fromsidei);
 
 	while (1) {
-		ssize_t readlen, to_write = 0, written;
+		ssize_t readlen, written, pending;
 		int more = 0;
 
 retry:
@@ -518,14 +518,11 @@ retry:
 
 			if (errno != EAGAIN)
 				goto close;
-
-			to_write = c->tcp.pipe_size;
 		} else if (!readlen) {
 			eof = 1;
-			to_write = c->tcp.pipe_size;
 		} else {
 			never_read = 0;
-			to_write += readlen;
+
 			if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
 				more = SPLICE_F_MORE;
 
@@ -535,10 +532,10 @@ retry:
 
 eintr:
 		written = splice(conn->pipe[fromsidei][0], NULL,
-				 conn->s[!fromsidei], NULL, to_write,
+				 conn->s[!fromsidei], NULL, c->tcp.pipe_size,
 				 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
 		flow_trace(conn, "%zi from write-side call (passed %zi)",
-			   written, to_write);
+			   written, c->tcp.pipe_size);
 
 		/* Most common case: skip updating counters. */
 		if (readlen > 0 && readlen == written) {
@@ -584,10 +581,9 @@ eintr:
 		if (never_read && written == (long)(c->tcp.pipe_size))
 			goto retry;
 
-		if (!never_read && written < to_write) {
-			to_write -= written;
+		pending = conn->read[fromsidei] - conn->written[fromsidei];
+		if (!never_read && written > 0 && written < pending)
 			goto retry;
-		}
 
 		if (eof)
 			break;
-- 
@@ -503,7 +503,7 @@ swap:
 	lowat_act_flag = RCVLOWAT_ACT(fromsidei);
 
 	while (1) {
-		ssize_t readlen, to_write = 0, written;
+		ssize_t readlen, written, pending;
 		int more = 0;
 
 retry:
@@ -518,14 +518,11 @@ retry:
 
 			if (errno != EAGAIN)
 				goto close;
-
-			to_write = c->tcp.pipe_size;
 		} else if (!readlen) {
 			eof = 1;
-			to_write = c->tcp.pipe_size;
 		} else {
 			never_read = 0;
-			to_write += readlen;
+
 			if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
 				more = SPLICE_F_MORE;
 
@@ -535,10 +532,10 @@ retry:
 
 eintr:
 		written = splice(conn->pipe[fromsidei][0], NULL,
-				 conn->s[!fromsidei], NULL, to_write,
+				 conn->s[!fromsidei], NULL, c->tcp.pipe_size,
 				 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
 		flow_trace(conn, "%zi from write-side call (passed %zi)",
-			   written, to_write);
+			   written, c->tcp.pipe_size);
 
 		/* Most common case: skip updating counters. */
 		if (readlen > 0 && readlen == written) {
@@ -584,10 +581,9 @@ eintr:
 		if (never_read && written == (long)(c->tcp.pipe_size))
 			goto retry;
 
-		if (!never_read && written < to_write) {
-			to_write -= written;
+		pending = conn->read[fromsidei] - conn->written[fromsidei];
+		if (!never_read && written > 0 && written < pending)
 			goto retry;
-		}
 
 		if (eof)
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read
  2024-10-24 14:11 [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read Stefano Brivio
@ 2024-10-25  0:27 ` David Gibson
  0 siblings, 0 replies; 2+ messages in thread
From: David Gibson @ 2024-10-25  0:27 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Ed Santiago, Paul Holzinger

[-- Attachment #1: Type: text/plain, Size: 5807 bytes --]

On Thu, Oct 24, 2024 at 04:11:10PM +0200, Stefano Brivio wrote:
> In tcp_splice_sock_handler(), we try to calculate how much we can move
> from the pipe to the writing socket: if we just read some bytes, we'll
> use that amount, but if we haven't, we just try to empty the pipe.
> 
> However, if we just read something, that doesn't mean that that's all
> the data we have on the pipe, as it's obvious from this sequence, where:
> 
>   pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
>   Flow 0 (TCP connection (spliced)): 98304 from read-side call
>   Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
>   Flow 0 (TCP connection (spliced)): -1 from read-side call
>   Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
>   Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
>   Flow 0 (TCP connection (spliced)): OUT_WAIT_0
> 
> we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
> couldn't write, as the receiver buffer is full, and we set the
> corresponding OUT_WAIT flag. Then:
> 
>   pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
>   Flow 0 (TCP connection (spliced)): 32768 from read-side call
>   Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
>   Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
> 
> we splice() 32768 more bytes from our receiving side to the pipe. At
> some point:
> 
>   pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
>   Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
>   Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
>   Flow 0 (TCP connection (spliced)): 1320 from read-side call
>   Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)
> 
> the receiver is signalling to us that it's ready for more data
> (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
> our receiving socket into the pipe, and that's what we write to the
> receiver, forgetting about the pending 97457 bytes we had, which the
> receiver might never get (not the same 97547 bytes: we'll actually
> send 1320 of those).
> 
> This condition is rather hard to reproduce, and it was observed with
> Podman pulling container images via HTTPS. In the traces above, the
> client is side 0 (the initiating peer), and the server is sending the
> whole data.
> 
> Instead of splicing from pipe to socket the amount of data we just
> read, we need to splice all the pending data we piled up until that
> point. We could do that using 'read' and 'written' counters, but
> there's actually no need, as the kernel also keeps track of how much
> data is available on the pipe.
> 
> So, to make this simple and more robust, just give the whole pipe size
> as length to splice(). The kernel knows what to do with it.
> 
> Later in the function, we used 'to_write' for an optimisation meant
> to reduce wakeups which retries right away to splice() in both
> directions if we couldn't write to the receiver the whole amount of
> pending data. Calculate a 'pending' value instead, only if we reach
> that point.
> 
> Now that we check for the actual amount of pending data in that
> optimisation, we need to make sure we don't compare a zero or negative
> 'written' value: if we met that, it means that the receiver signalled
> end-of-file, an error, or to try again later. In those three cases,
> the optimisation doesn't make any sense, so skip it.
> 
> Reported-by: Ed Santiago <santiago@redhat.com>
> Reported-by: Paul Holzinger <pholzing@redhat.com>
> Analysed-by: Paul Holzinger <pholzing@redhat.com>
> Link: https://github.com/containers/podman/issues/24219
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp_splice.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 9f5cc27..f112cfe 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -503,7 +503,7 @@ swap:
>  	lowat_act_flag = RCVLOWAT_ACT(fromsidei);
>  
>  	while (1) {
> -		ssize_t readlen, to_write = 0, written;
> +		ssize_t readlen, written, pending;
>  		int more = 0;
>  
>  retry:
> @@ -518,14 +518,11 @@ retry:
>  
>  			if (errno != EAGAIN)
>  				goto close;
> -
> -			to_write = c->tcp.pipe_size;
>  		} else if (!readlen) {
>  			eof = 1;
> -			to_write = c->tcp.pipe_size;
>  		} else {
>  			never_read = 0;
> -			to_write += readlen;
> +
>  			if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
>  				more = SPLICE_F_MORE;
>  
> @@ -535,10 +532,10 @@ retry:
>  
>  eintr:
>  		written = splice(conn->pipe[fromsidei][0], NULL,
> -				 conn->s[!fromsidei], NULL, to_write,
> +				 conn->s[!fromsidei], NULL, c->tcp.pipe_size,
>  				 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
>  		flow_trace(conn, "%zi from write-side call (passed %zi)",
> -			   written, to_write);
> +			   written, c->tcp.pipe_size);
>  
>  		/* Most common case: skip updating counters. */
>  		if (readlen > 0 && readlen == written) {
> @@ -584,10 +581,9 @@ eintr:
>  		if (never_read && written == (long)(c->tcp.pipe_size))
>  			goto retry;
>  
> -		if (!never_read && written < to_write) {
> -			to_write -= written;
> +		pending = conn->read[fromsidei] - conn->written[fromsidei];
> +		if (!never_read && written > 0 && written < pending)
>  			goto retry;
> -		}
>  
>  		if (eof)
>  			break;

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-10-25  0:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-24 14:11 [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read Stefano Brivio
2024-10-25  0:27 ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).