* [PATCH] tcp_splice: Set (again) TCP_NODELAY on both sides
@ 2025-01-06 9:42 Stefano Brivio
2025-01-08 3:43 ` David Gibson
0 siblings, 1 reply; 2+ messages in thread
From: Stefano Brivio @ 2025-01-06 9:42 UTC (permalink / raw)
To: passt-dev
In commit 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced
sockets") I just assumed that we wouldn't benefit from disabling
Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).
It turns out that with some patterns, such as a PostgreSQL server
in a container receiving parameterised, short queries, for which pasta
sees several short inbound messages (Parse, Bind, Describe, Execute
and Sync commands getting each one their own packet, 5 to 49 bytes TCP
payload each), we'll read them usually in two batches, and send them
in matching batches, for example:
9165.2467: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
9165.2468: Flow 0 (TCP connection (spliced)): 76 from read-side call
9165.2468: Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
9165.2469: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
9165.2470: Flow 0 (TCP connection (spliced)): 15 from read-side call
9165.2470: Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
9165.2944: pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)
and the kernel delivers the first one, waits for acknowledgement from
the receiver, then delivers the second one. This adds very substantial
and unnecessary delay. It's usually a fixed ~40ms between the two
batches, which is clearly unacceptable for loopback connections.
In this example, the delay is shown by the timestamp of the response
from socket 118. The peer (server) doesn't actually take that long
(less than a millisecond), but it takes that long for the kernel to
deliver our request.
To avoid batching and delays, disable Nagle's algorithm by setting
TCP_NODELAY on both internal and external sockets: this way, we get
one inbound packet for each original message, we transfer them right
away, and the kernel delivers them to the process in the container as
they are, without delay.
We can do this safely as we don't care much about network utilisation
when there's in fact pretty much no network (loopback connections).
This is unfortunately not visible in the TCP request-response tests
from the test suite because, with smaller messages (we use one byte),
Nagle's algorithm doesn't even kick in. It's probably not trivial to
implement a universal test covering this case.
Fixes: 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp_splice.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 3a0f868..3a000ff 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -348,6 +348,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
uint8_t tgtpif = conn->f.pif[TGTSIDE];
union sockaddr_inany sa;
socklen_t sl;
+ int one = 1;
if (tgtpif == PIF_HOST)
conn->s[1] = tcp_conn_sock(c, af);
@@ -359,12 +360,21 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
if (conn->s[1] < 0)
return -1;
- if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
- &((int){ 1 }), sizeof(int))) {
+ if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_QUICKACK on socket %i",
conn->s[1]);
}
+ if (setsockopt(conn->s[0], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
+ flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
+ conn->s[0]);
+ }
+
+ if (setsockopt(conn->s[1], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
+ flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
+ conn->s[1]);
+ }
+
pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
if (connect(conn->s[1], &sa.sa, sl)) {
--
@@ -348,6 +348,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
uint8_t tgtpif = conn->f.pif[TGTSIDE];
union sockaddr_inany sa;
socklen_t sl;
+ int one = 1;
if (tgtpif == PIF_HOST)
conn->s[1] = tcp_conn_sock(c, af);
@@ -359,12 +360,21 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
if (conn->s[1] < 0)
return -1;
- if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
- &((int){ 1 }), sizeof(int))) {
+ if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &one, sizeof(one))) {
flow_trace(conn, "failed to set TCP_QUICKACK on socket %i",
conn->s[1]);
}
+ if (setsockopt(conn->s[0], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
+ flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
+ conn->s[0]);
+ }
+
+ if (setsockopt(conn->s[1], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
+ flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
+ conn->s[1]);
+ }
+
pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
if (connect(conn->s[1], &sa.sa, sl)) {
--
2.43.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] tcp_splice: Set (again) TCP_NODELAY on both sides
2025-01-06 9:42 [PATCH] tcp_splice: Set (again) TCP_NODELAY on both sides Stefano Brivio
@ 2025-01-08 3:43 ` David Gibson
0 siblings, 0 replies; 2+ messages in thread
From: David Gibson @ 2025-01-08 3:43 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 5353 bytes --]
On Mon, Jan 06, 2025 at 10:42:50AM +0100, Stefano Brivio wrote:
> In commit 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced
> sockets") I just assumed that we wouldn't benefit from disabling
> Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).
>
> It turns out that with some patterns, such as a PostgreSQL server
> in a container receiving parameterised, short queries, for which pasta
> sees several short inbound messages (Parse, Bind, Describe, Execute
> and Sync commands getting each one their own packet, 5 to 49 bytes TCP
> payload each), we'll read them usually in two batches, and send them
> in matching batches, for example:
>
> 9165.2467: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
> 9165.2468: Flow 0 (TCP connection (spliced)): 76 from read-side call
> 9165.2468: Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
> 9165.2469: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
> 9165.2470: Flow 0 (TCP connection (spliced)): 15 from read-side call
> 9165.2470: Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
> 9165.2944: pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)
>
> and the kernel delivers the first one, waits for acknowledgement from
> the receiver, then delivers the second one. This adds very substantial
> and unnecessary delay. It's usually a fixed ~40ms between the two
> batches, which is clearly unacceptable for loopback connections.
>
> In this example, the delay is shown by the timestamp of the response
> from socket 118. The peer (server) doesn't actually take that long
> (less than a millisecond), but it takes that long for the kernel to
> deliver our request.
>
> To avoid batching and delays, disable Nagle's algorithm by setting
> TCP_NODELAY on both internal and external sockets: this way, we get
> one inbound packet for each original message, we transfer them right
> away, and the kernel delivers them to the process in the container as
> they are, without delay.
>
> We can do this safely as we don't care much about network utilisation
> when there's in fact pretty much no network (loopback connections).
It's true we don't care about network utilisation here, but I don't
think that's actually relevant. As discussed in our call, I think we
should be disabling Nagle on *every* TCP socket, "tap" ones as well.
Nagle's algorithm is (loosely) for the case where we're generating
data relatively slowly, for reasons that are independent of getting
replies from the other end. Only an endpoint has the information to
assert that, and the fact that it's enabled by default is arguably an
anachronism. As a forwarder, we don't have the right to apply Nagle's
algorithm to data passing through, and it's also unncessary: if the
stream is Nagle-suitable, the endpoint will be using it, so we'll
be receiving already-Nagled chunks of data.
Or to put it another way, Nagle is for gathering multiple write()s
into a single TCP packet, not multiple TCP packets into a combined TCP
packet.
> This is unfortunately not visible in the TCP request-response tests
> from the test suite because, with smaller messages (we use one byte),
> Nagle's algorithm doesn't even kick in. It's probably not trivial to
> implement a universal test covering this case.
>
> Fixes: 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In that it's correct as far as it goes.
> ---
> tcp_splice.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 3a0f868..3a000ff 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -348,6 +348,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
> uint8_t tgtpif = conn->f.pif[TGTSIDE];
> union sockaddr_inany sa;
> socklen_t sl;
> + int one = 1;
>
> if (tgtpif == PIF_HOST)
> conn->s[1] = tcp_conn_sock(c, af);
> @@ -359,12 +360,21 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
> if (conn->s[1] < 0)
> return -1;
>
> - if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
> - &((int){ 1 }), sizeof(int))) {
> + if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &one, sizeof(one))) {
> flow_trace(conn, "failed to set TCP_QUICKACK on socket %i",
> conn->s[1]);
> }
>
> + if (setsockopt(conn->s[0], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
> + flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
> + conn->s[0]);
> + }
> +
> + if (setsockopt(conn->s[1], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) {
> + flow_trace(conn, "failed to set TCP_NODELAY on socket %i",
> + conn->s[1]);
> + }
> +
> pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
>
> if (connect(conn->s[1], &sa.sa, sl)) {
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-01-08 3:44 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-06 9:42 [PATCH] tcp_splice: Set (again) TCP_NODELAY on both sides Stefano Brivio
2025-01-08 3:43 ` David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).