* [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone
@ 2025-02-16 22:12 Stefano Brivio
2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio
2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson
0 siblings, 2 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-02-16 22:12 UTC (permalink / raw)
To: passt-dev
In commit e5eefe77435a ("tcp: Refactor to use events instead of
states, split out spliced implementation"), this:
if (!bitmap_isset(rcvlowat_set, conn - ts) &&
readlen > (long)c->tcp.pipe_size / 10) {
(note the !) became:
if (conn->flags & lowat_set_flag &&
readlen > (long)c->tcp.pipe_size / 10) {
in the new tcp_splice_sock_handler().
We want to check, there, if we should set SO_RCVLOWAT, only if we
haven't set it already.
But, instead, we're checking if it's already set before we set it, so
we'll never set it, of course.
Fix the check and re-enable the functionality, which should give us
improved CPU utilisation in non-interactive cases where we are not
transferring at full pipe capacity.
Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp_splice.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 8a39a6f..5d845c9 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -556,7 +556,7 @@ eintr:
if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
continue;
- if (conn->flags & lowat_set_flag &&
+ if (!(conn->flags & lowat_set_flag) &&
readlen > (long)c->tcp.pipe_size / 10) {
int lowat = c->tcp.pipe_size / 4;
--
@@ -556,7 +556,7 @@ eintr:
if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
continue;
- if (conn->flags & lowat_set_flag &&
+ if (!(conn->flags & lowat_set_flag) &&
readlen > (long)c->tcp.pipe_size / 10) {
int lowat = c->tcp.pipe_size / 4;
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere
2025-02-16 22:12 [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone Stefano Brivio
@ 2025-02-16 22:12 ` Stefano Brivio
2025-02-17 3:51 ` David Gibson
2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson
1 sibling, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2025-02-16 22:12 UTC (permalink / raw)
To: passt-dev
If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
given flow, it means that we're blocked, waiting for the receiver to
actually receive data, with a full pipe.
In that case, if we keep EPOLLIN set for the socket on the other side
(our receiving side), we'll get into a loop such as:
41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call
41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call
41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
leading to 100% CPU usage, of course.
Drop EPOLLIN on our receiving side as long when we're waiting for
output readiness on the other side.
Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp_splice.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index f1a9223..8a39a6f 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
ev[1].events = EPOLLOUT;
}
- flow_foreach_sidei(sidei)
- ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
+ flow_foreach_sidei(sidei) {
+ if (events & OUT_WAIT(sidei)) {
+ ev[sidei].events |= EPOLLOUT;
+ ev[!sidei].events &= ~EPOLLIN;
+ }
+ }
}
/**
--
@@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
ev[1].events = EPOLLOUT;
}
- flow_foreach_sidei(sidei)
- ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
+ flow_foreach_sidei(sidei) {
+ if (events & OUT_WAIT(sidei)) {
+ ev[sidei].events |= EPOLLOUT;
+ ev[!sidei].events &= ~EPOLLIN;
+ }
+ }
}
/**
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone
2025-02-16 22:12 [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone Stefano Brivio
2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio
@ 2025-02-17 3:49 ` David Gibson
2025-02-17 7:12 ` Stefano Brivio
1 sibling, 1 reply; 5+ messages in thread
From: David Gibson @ 2025-02-17 3:49 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 2045 bytes --]
On Sun, Feb 16, 2025 at 11:12:15PM +0100, Stefano Brivio wrote:
> In commit e5eefe77435a ("tcp: Refactor to use events instead of
> states, split out spliced implementation"), this:
>
> if (!bitmap_isset(rcvlowat_set, conn - ts) &&
> readlen > (long)c->tcp.pipe_size / 10) {
>
> (note the !) became:
>
> if (conn->flags & lowat_set_flag &&
> readlen > (long)c->tcp.pipe_size / 10) {
>
> in the new tcp_splice_sock_handler().
>
> We want to check, there, if we should set SO_RCVLOWAT, only if we
> haven't set it already.
>
> But, instead, we're checking if it's already set before we set it, so
> we'll never set it, of course.
>
> Fix the check and re-enable the functionality, which should give us
> improved CPU utilisation in non-interactive cases where we are not
> transferring at full pipe capacity.
>
> Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Ouch.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
At least insofar as this clearly corrects towards the intended
behaviour. Given that we inadvertently bee using RCVLOWAT for so
long, I am a bit worried that this might expose deadlocks or stalls.
But, I guess we debug that when we come to it.
> ---
> tcp_splice.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 8a39a6f..5d845c9 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -556,7 +556,7 @@ eintr:
> if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
> continue;
>
> - if (conn->flags & lowat_set_flag &&
> + if (!(conn->flags & lowat_set_flag) &&
> readlen > (long)c->tcp.pipe_size / 10) {
> int lowat = c->tcp.pipe_size / 4;
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere
2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio
@ 2025-02-17 3:51 ` David Gibson
0 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2025-02-17 3:51 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 2410 bytes --]
On Sun, Feb 16, 2025 at 11:12:16PM +0100, Stefano Brivio wrote:
> If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a
> given flow, it means that we're blocked, waiting for the receiver to
> actually receive data, with a full pipe.
>
> In that case, if we keep EPOLLIN set for the socket on the other side
> (our receiving side), we'll get into a loop such as:
>
> 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
> 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call
> 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
> 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
> 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001)
> 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call
> 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192)
> 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577
>
> leading to 100% CPU usage, of course.
>
> Drop EPOLLIN on our receiving side as long when we're waiting for
> output readiness on the other side.
>
> Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584
> Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index f1a9223..8a39a6f 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
> ev[1].events = EPOLLOUT;
> }
>
> - flow_foreach_sidei(sidei)
> - ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0;
> + flow_foreach_sidei(sidei) {
> + if (events & OUT_WAIT(sidei)) {
> + ev[sidei].events |= EPOLLOUT;
> + ev[!sidei].events &= ~EPOLLIN;
> + }
> + }
> }
>
> /**
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone
2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson
@ 2025-02-17 7:12 ` Stefano Brivio
0 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-02-17 7:12 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Mon, 17 Feb 2025 14:49:39 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Sun, Feb 16, 2025 at 11:12:15PM +0100, Stefano Brivio wrote:
> > In commit e5eefe77435a ("tcp: Refactor to use events instead of
> > states, split out spliced implementation"), this:
> >
> > if (!bitmap_isset(rcvlowat_set, conn - ts) &&
> > readlen > (long)c->tcp.pipe_size / 10) {
> >
> > (note the !) became:
> >
> > if (conn->flags & lowat_set_flag &&
> > readlen > (long)c->tcp.pipe_size / 10) {
> >
> > in the new tcp_splice_sock_handler().
> >
> > We want to check, there, if we should set SO_RCVLOWAT, only if we
> > haven't set it already.
> >
> > But, instead, we're checking if it's already set before we set it, so
> > we'll never set it, of course.
> >
> > Fix the check and re-enable the functionality, which should give us
> > improved CPU utilisation in non-interactive cases where we are not
> > transferring at full pipe capacity.
> >
> > Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation")
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
>
> Ouch.
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> At least insofar as this clearly corrects towards the intended
> behaviour. Given that we inadvertently bee using RCVLOWAT for so
> long, I am a bit worried that this might expose deadlocks or stalls.
> But, I guess we debug that when we come to it.
Yeah, I was undecided as well, then I tested and tested, and I
realised that commit 904b86ade7db ("tcp: Rework window handling, timers,
add SO_RCVLOWAT and pools for sockets/pipes") added this gem, still there:
if (read >= (long)c->tcp.pipe_size * 10 / 100)
continue;
if (!bitmap_isset(rcvlowat_set, conn - ts) &&
read > (long)c->tcp.pipe_size / 10) {
int lowat = c->tcp.pipe_size / 4;
setsockopt(move_from, SOL_SOCKET, SO_RCVLOWAT,
&lowat, sizeof(lowat));
which means that we'll not set SO_RCVLOWAT anyway, because if
read > c->tcp.pipe_size / 10, we'll skip the second block, and if not,
we'll skip it anyway.
Now, I have a clear memory of characterising those 10% and 25% values
over a wide range of pipe sizes, message sizes, etc. Other than SSH and
installing packages in a container (and check that nothing gets stuck
for ~one second), my basic test idea was:
$ iperf3 -c localhost -p 5202 -l 100
with port 5202 forwarded to the network namespace and iperf3 server
listening there.
I think I added another typo while cleaning up. I probably meant:
[...]
read < (long)c->tcp.pipe_size / 10) {
...and if I do that, it finally works.
But anyway, it's too many typos at this point and especially we never
had a release with it enabled, so I'm not "fixing" this for the moment,
it needs a lot more testing than I can do now.
--
Stefano
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-02-17 7:12 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-16 22:12 [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone Stefano Brivio
2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio
2025-02-17 3:51 ` David Gibson
2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson
2025-02-17 7:12 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).