From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202412 header.b=R1b7PqTx; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 4A2CA5A004E for ; Wed, 08 Jan 2025 04:44:15 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202412; t=1736307831; bh=7F8AqjG1WZ0WBe8EFJnljyQpvOLHiRXHpBTHtBMk0ds=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=R1b7PqTxI07/sSo7TMgBQbtiud0KweK1STZ+kJZZ5giB+63UrMopBw/BoUKMvPFlX NtCdRF8Qy4sdq49mO50Zg8Q55Wpy2KgtXHnep/fBLujYe1/B3u5iWd+LG99YXWm+LK ymelYV1o6jfMFFjWbfhgu/9JniP+uXoSABVrNMn3j763TTsuImCSed5tvph1mFR6hT JnULQ1Sl6uyoKjKw9FMPv2KB+Dm1BbSFWegLpS5NeXb+9WOD9PLEEnXmV8xLKM7OM0 Ry1Y27VWSn12THr8oWgBlflZfLibaXE7r591lOkUWhtlsyjsw3bHYh6/OFpHIQ+BuD s5UPlUQi6RbeA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YSYgC1Fl5z4wvb; Wed, 8 Jan 2025 14:43:51 +1100 (AEDT) Date: Wed, 8 Jan 2025 14:43:49 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] tcp_splice: Set (again) TCP_NODELAY on both sides Message-ID: References: <20250106094250.3054245-1-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="1yVPhCdH3ya2gjbk" Content-Disposition: inline In-Reply-To: <20250106094250.3054245-1-sbrivio@redhat.com> Message-ID-Hash: RUBHYYX7BYKOWC322M6IDC2UI26XTX6E X-Message-ID-Hash: RUBHYYX7BYKOWC322M6IDC2UI26XTX6E X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --1yVPhCdH3ya2gjbk Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jan 06, 2025 at 10:42:50AM +0100, Stefano Brivio wrote: > In commit 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced > sockets") I just assumed that we wouldn't benefit from disabling > Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay). >=20 > It turns out that with some patterns, such as a PostgreSQL server > in a container receiving parameterised, short queries, for which pasta > sees several short inbound messages (Parse, Bind, Describe, Execute > and Sync commands getting each one their own packet, 5 to 49 bytes TCP > payload each), we'll read them usually in two batches, and send them > in matching batches, for example: >=20 > 9165.2467: pasta: epoll event on connected spliced TCP socket = 117 (events: 0x00000001) > 9165.2468: Flow 0 (TCP connection (spliced)): 76 from read-sid= e call > 9165.2468: Flow 0 (TCP connection (spliced)): 76 from write-si= de call (passed 524288) > 9165.2469: pasta: epoll event on connected spliced TCP socket = 117 (events: 0x00000001) > 9165.2470: Flow 0 (TCP connection (spliced)): 15 from read-sid= e call > 9165.2470: Flow 0 (TCP connection (spliced)): 15 from write-si= de call (passed 524288) > 9165.2944: pasta: epoll event on connected spliced TCP socket = 118 (events: 0x00000001) >=20 > and the kernel delivers the first one, waits for acknowledgement from > the receiver, then delivers the second one. This adds very substantial > and unnecessary delay. It's usually a fixed ~40ms between the two > batches, which is clearly unacceptable for loopback connections. >=20 > In this example, the delay is shown by the timestamp of the response > from socket 118. The peer (server) doesn't actually take that long > (less than a millisecond), but it takes that long for the kernel to > deliver our request. >=20 > To avoid batching and delays, disable Nagle's algorithm by setting > TCP_NODELAY on both internal and external sockets: this way, we get > one inbound packet for each original message, we transfer them right > away, and the kernel delivers them to the process in the container as > they are, without delay. >=20 > We can do this safely as we don't care much about network utilisation > when there's in fact pretty much no network (loopback connections). It's true we don't care about network utilisation here, but I don't think that's actually relevant. As discussed in our call, I think we should be disabling Nagle on *every* TCP socket, "tap" ones as well. Nagle's algorithm is (loosely) for the case where we're generating data relatively slowly, for reasons that are independent of getting replies from the other end. Only an endpoint has the information to assert that, and the fact that it's enabled by default is arguably an anachronism. As a forwarder, we don't have the right to apply Nagle's algorithm to data passing through, and it's also unncessary: if the stream is Nagle-suitable, the endpoint will be using it, so we'll be receiving already-Nagled chunks of data. Or to put it another way, Nagle is for gathering multiple write()s into a single TCP packet, not multiple TCP packets into a combined TCP packet. > This is unfortunately not visible in the TCP request-response tests > from the test suite because, with smaller messages (we use one byte), > Nagle's algorithm doesn't even kick in. It's probably not trivial to > implement a universal test covering this case. >=20 > Fixes: 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced sockets") > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson In that it's correct as far as it goes. > --- > tcp_splice.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) >=20 > diff --git a/tcp_splice.c b/tcp_splice.c > index 3a0f868..3a000ff 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -348,6 +348,7 @@ static int tcp_splice_connect(const struct ctx *c, st= ruct tcp_splice_conn *conn) > uint8_t tgtpif =3D conn->f.pif[TGTSIDE]; > union sockaddr_inany sa; > socklen_t sl; > + int one =3D 1; > =20 > if (tgtpif =3D=3D PIF_HOST) > conn->s[1] =3D tcp_conn_sock(c, af); > @@ -359,12 +360,21 @@ static int tcp_splice_connect(const struct ctx *c, = struct tcp_splice_conn *conn) > if (conn->s[1] < 0) > return -1; > =20 > - if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, > - &((int){ 1 }), sizeof(int))) { > + if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &one, sizeof(one))) { > flow_trace(conn, "failed to set TCP_QUICKACK on socket %i", > conn->s[1]); > } > =20 > + if (setsockopt(conn->s[0], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) { > + flow_trace(conn, "failed to set TCP_NODELAY on socket %i", > + conn->s[0]); > + } > + > + if (setsockopt(conn->s[1], SOL_TCP, TCP_NODELAY, &one, sizeof(one))) { > + flow_trace(conn, "failed to set TCP_NODELAY on socket %i", > + conn->s[1]); > + } > + > pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport); > =20 > if (connect(conn->s[1], &sa.sa, sl)) { --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --1yVPhCdH3ya2gjbk Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmd99GYACgkQzQJF27ox 2GeEYw//UpdG3BG5z6YFxeD0bCPfJngg0oyjO8rqXY8r9LjjuS+Pi2s3UnihmmS9 RGJwdS33qlQiHI8cWljC4PpW2I31CDFOMeltPSpYUVrZ2USeUf5sKb95eezmCFfx MJEr67RVRXvIL54RKvPkm/J944Hx6TXeeVzt+e9pNzRK2fBfsVzjRqy98S6tg1pa 3tcZink0MbfV0wvPMFJpcYrsRP0NkOXv+QDm366Cs1bYjW0wEmXiePZgszINJrRg NUHM1EyQVQWvD+6ifkHrV090+0hEL09r4e9mggKKFgycDPCdpffce5NmlZP59rID UPFVjdbaXSxFqCEszpu63nAhSwqVLQCKOshan3ns3xHv6hUrUectxbtHWN/9IqPM /P93kGSmAScbDQCmRg6lQ+kKqMty4Y3SqJ8ndUCp3aZiycJIDHwBy328H1tzQ9BH 2xivn/Gc0XvjFRRspAo+MN/0TWv78JL8Pz76hSuvRhCrXGnpITbN3W1+DhegFKdE opykc65M8U5/iSTcZ7nX/RhfGH4019xT+Cu7vUR9iNKjwkkVye11in46CNUWxIEA heD53CtONWbJeZVVXtZlPG4/HORO3O3+EPf6Qk74wCNICNsMITfWr4kajHRrLzmv CaWEqFt1NbQxKTFelfy/Z/iPLDrg9ZXc9+E5vPzTPz7zhBw/spA= =H/Dh -----END PGP SIGNATURE----- --1yVPhCdH3ya2gjbk--