* [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone @ 2025-02-16 22:12 Stefano Brivio 2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio 2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson 0 siblings, 2 replies; 5+ messages in thread From: Stefano Brivio @ 2025-02-16 22:12 UTC (permalink / raw) To: passt-dev In commit e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation"), this: if (!bitmap_isset(rcvlowat_set, conn - ts) && readlen > (long)c->tcp.pipe_size / 10) { (note the !) became: if (conn->flags & lowat_set_flag && readlen > (long)c->tcp.pipe_size / 10) { in the new tcp_splice_sock_handler(). We want to check, there, if we should set SO_RCVLOWAT, only if we haven't set it already. But, instead, we're checking if it's already set before we set it, so we'll never set it, of course. Fix the check and re-enable the functionality, which should give us improved CPU utilisation in non-interactive cases where we are not transferring at full pipe capacity. Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp_splice.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp_splice.c b/tcp_splice.c index 8a39a6f..5d845c9 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -556,7 +556,7 @@ eintr: if (readlen >= (long)c->tcp.pipe_size * 10 / 100) continue; - if (conn->flags & lowat_set_flag && + if (!(conn->flags & lowat_set_flag) && readlen > (long)c->tcp.pipe_size / 10) { int lowat = c->tcp.pipe_size / 4; -- @@ -556,7 +556,7 @@ eintr: if (readlen >= (long)c->tcp.pipe_size * 10 / 100) continue; - if (conn->flags & lowat_set_flag && + if (!(conn->flags & lowat_set_flag) && readlen > (long)c->tcp.pipe_size / 10) { int lowat = c->tcp.pipe_size / 4; -- 2.43.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere 2025-02-16 22:12 [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone Stefano Brivio @ 2025-02-16 22:12 ` Stefano Brivio 2025-02-17 3:51 ` David Gibson 2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson 1 sibling, 1 reply; 5+ messages in thread From: Stefano Brivio @ 2025-02-16 22:12 UTC (permalink / raw) To: passt-dev If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a given flow, it means that we're blocked, waiting for the receiver to actually receive data, with a full pipe. In that case, if we keep EPOLLIN set for the socket on the other side (our receiving side), we'll get into a loop such as: 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 leading to 100% CPU usage, of course. Drop EPOLLIN on our receiving side as long when we're waiting for output readiness on the other side. Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584 Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/ Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp_splice.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index f1a9223..8a39a6f 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events, ev[1].events = EPOLLOUT; } - flow_foreach_sidei(sidei) - ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0; + flow_foreach_sidei(sidei) { + if (events & OUT_WAIT(sidei)) { + ev[sidei].events |= EPOLLOUT; + ev[!sidei].events &= ~EPOLLIN; + } + } } /** -- @@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events, ev[1].events = EPOLLOUT; } - flow_foreach_sidei(sidei) - ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0; + flow_foreach_sidei(sidei) { + if (events & OUT_WAIT(sidei)) { + ev[sidei].events |= EPOLLOUT; + ev[!sidei].events &= ~EPOLLIN; + } + } } /** -- 2.43.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere 2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio @ 2025-02-17 3:51 ` David Gibson 0 siblings, 0 replies; 5+ messages in thread From: David Gibson @ 2025-02-17 3:51 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 2410 bytes --] On Sun, Feb 16, 2025 at 11:12:16PM +0100, Stefano Brivio wrote: > If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a > given flow, it means that we're blocked, waiting for the receiver to > actually receive data, with a full pipe. > > In that case, if we keep EPOLLIN set for the socket on the other side > (our receiving side), we'll get into a loop such as: > > 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) > 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call > 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) > 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 > 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) > 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call > 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) > 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 > > leading to 100% CPU usage, of course. > > Drop EPOLLIN on our receiving side as long when we're waiting for > output readiness on the other side. > > Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584 > Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/ > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > --- > tcp_splice.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/tcp_splice.c b/tcp_splice.c > index f1a9223..8a39a6f 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -131,8 +131,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events, > ev[1].events = EPOLLOUT; > } > > - flow_foreach_sidei(sidei) > - ev[sidei].events |= (events & OUT_WAIT(sidei)) ? EPOLLOUT : 0; > + flow_foreach_sidei(sidei) { > + if (events & OUT_WAIT(sidei)) { > + ev[sidei].events |= EPOLLOUT; > + ev[!sidei].events &= ~EPOLLIN; > + } > + } > } > > /** -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone 2025-02-16 22:12 [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone Stefano Brivio 2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio @ 2025-02-17 3:49 ` David Gibson 2025-02-17 7:12 ` Stefano Brivio 1 sibling, 1 reply; 5+ messages in thread From: David Gibson @ 2025-02-17 3:49 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 2045 bytes --] On Sun, Feb 16, 2025 at 11:12:15PM +0100, Stefano Brivio wrote: > In commit e5eefe77435a ("tcp: Refactor to use events instead of > states, split out spliced implementation"), this: > > if (!bitmap_isset(rcvlowat_set, conn - ts) && > readlen > (long)c->tcp.pipe_size / 10) { > > (note the !) became: > > if (conn->flags & lowat_set_flag && > readlen > (long)c->tcp.pipe_size / 10) { > > in the new tcp_splice_sock_handler(). > > We want to check, there, if we should set SO_RCVLOWAT, only if we > haven't set it already. > > But, instead, we're checking if it's already set before we set it, so > we'll never set it, of course. > > Fix the check and re-enable the functionality, which should give us > improved CPU utilisation in non-interactive cases where we are not > transferring at full pipe capacity. > > Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Ouch. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> At least insofar as this clearly corrects towards the intended behaviour. Given that we inadvertently bee using RCVLOWAT for so long, I am a bit worried that this might expose deadlocks or stalls. But, I guess we debug that when we come to it. > --- > tcp_splice.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tcp_splice.c b/tcp_splice.c > index 8a39a6f..5d845c9 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -556,7 +556,7 @@ eintr: > if (readlen >= (long)c->tcp.pipe_size * 10 / 100) > continue; > > - if (conn->flags & lowat_set_flag && > + if (!(conn->flags & lowat_set_flag) && > readlen > (long)c->tcp.pipe_size / 10) { > int lowat = c->tcp.pipe_size / 4; > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone 2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson @ 2025-02-17 7:12 ` Stefano Brivio 0 siblings, 0 replies; 5+ messages in thread From: Stefano Brivio @ 2025-02-17 7:12 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Mon, 17 Feb 2025 14:49:39 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Sun, Feb 16, 2025 at 11:12:15PM +0100, Stefano Brivio wrote: > > In commit e5eefe77435a ("tcp: Refactor to use events instead of > > states, split out spliced implementation"), this: > > > > if (!bitmap_isset(rcvlowat_set, conn - ts) && > > readlen > (long)c->tcp.pipe_size / 10) { > > > > (note the !) became: > > > > if (conn->flags & lowat_set_flag && > > readlen > (long)c->tcp.pipe_size / 10) { > > > > in the new tcp_splice_sock_handler(). > > > > We want to check, there, if we should set SO_RCVLOWAT, only if we > > haven't set it already. > > > > But, instead, we're checking if it's already set before we set it, so > > we'll never set it, of course. > > > > Fix the check and re-enable the functionality, which should give us > > improved CPU utilisation in non-interactive cases where we are not > > transferring at full pipe capacity. > > > > Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > Ouch. > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > > At least insofar as this clearly corrects towards the intended > behaviour. Given that we inadvertently bee using RCVLOWAT for so > long, I am a bit worried that this might expose deadlocks or stalls. > But, I guess we debug that when we come to it. Yeah, I was undecided as well, then I tested and tested, and I realised that commit 904b86ade7db ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes") added this gem, still there: if (read >= (long)c->tcp.pipe_size * 10 / 100) continue; if (!bitmap_isset(rcvlowat_set, conn - ts) && read > (long)c->tcp.pipe_size / 10) { int lowat = c->tcp.pipe_size / 4; setsockopt(move_from, SOL_SOCKET, SO_RCVLOWAT, &lowat, sizeof(lowat)); which means that we'll not set SO_RCVLOWAT anyway, because if read > c->tcp.pipe_size / 10, we'll skip the second block, and if not, we'll skip it anyway. Now, I have a clear memory of characterising those 10% and 25% values over a wide range of pipe sizes, message sizes, etc. Other than SSH and installing packages in a container (and check that nothing gets stuck for ~one second), my basic test idea was: $ iperf3 -c localhost -p 5202 -l 100 with port 5202 forwarded to the network namespace and iperf3 server listening there. I think I added another typo while cleaning up. I probably meant: [...] read < (long)c->tcp.pipe_size / 10) { ...and if I do that, it finally works. But anyway, it's too many typos at this point and especially we never had a release with it enabled, so I'm not "fixing" this for the moment, it needs a lot more testing than I can do now. -- Stefano ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-02-17 7:12 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-02-16 22:12 [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone Stefano Brivio 2025-02-16 22:12 ` [PATCH] tcp_splice: Don't wake up on input data if we can't write it anywhere Stefano Brivio 2025-02-17 3:51 ` David Gibson 2025-02-17 3:49 ` [PATCH] tcp_splice: A typo three years ago and SO_RCVLOWAT is gone David Gibson 2025-02-17 7:12 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).