public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets
@ 2025-01-20 17:28 Stefano Brivio
  2025-01-21  3:01 ` David Gibson
  2025-01-21 12:42 ` David Gibson
  0 siblings, 2 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-01-20 17:28 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
both sides"), David argues that, in general, we don't know what kind
of TCP traffic we're dealing with, on any side or path.

TCP segments might have been delivered to our socket with a PSH flag,
but we don't have a way to know about it.

Similarly, the guest might send us segments with PSH or URG set, but
we don't know if we should generally TCP_CORK sockets and uncork on
those flags, because that would assume they're running a Linux kernel
(and a particular version of it) matching the kernel that delivers
outbound packets for us.

Given that we can't make any assumption and everything might very well
be interactive traffic, disable Nagle's algorithm on all non-spliced
sockets as well.

After all, John Nagle himself is nowadays recommending that delayed
ACKs should never be enabled together with his algorithm, but we
don't have a practical way to ensure that our environment is free from
delayed ACKs (TCP_QUICKACK is not really usable for this purpose):

  https://news.ycombinator.com/item?id=34180239

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2: Set TCP_NODELAY on inbound socket after accept4(), not on the
    listening sockets, and change failure message to debug()
    instead of trace()

 tcp.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/tcp.c b/tcp.c
index a012b81..4d6a6b3 100644
--- a/tcp.c
+++ b/tcp.c
@@ -756,6 +756,19 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
 		trace("TCP: failed to set SO_SNDBUF to %i", v);
 }
 
+/**
+ * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
+ * @s:		Socket, can be -1 to avoid check in the caller
+ */
+static void tcp_sock_set_nodelay(int s)
+{
+	if (s == -1)
+		return;
+
+	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &((int){ 1 }), sizeof(int)))
+		debug("TCP: failed to set TCP_NODELAY on socket %i", s);
+}
+
 /**
  * tcp_update_csum() - Calculate TCP checksum
  * @psum:	Unfolded partial checksum of the IPv4 or IPv6 pseudo-header
@@ -1285,6 +1298,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 		return -errno;
 
 	tcp_sock_set_bufsize(c, s);
+	tcp_sock_set_nodelay(s);
 
 	return s;
 }
@@ -2058,6 +2072,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 		goto cancel;
 
 	tcp_sock_set_bufsize(c, s);
+	tcp_sock_set_nodelay(s);
 
 	/* FIXME: When listening port has a specific bound address, record that
 	 * as our address
-- 
@@ -756,6 +756,19 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
 		trace("TCP: failed to set SO_SNDBUF to %i", v);
 }
 
+/**
+ * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
+ * @s:		Socket, can be -1 to avoid check in the caller
+ */
+static void tcp_sock_set_nodelay(int s)
+{
+	if (s == -1)
+		return;
+
+	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &((int){ 1 }), sizeof(int)))
+		debug("TCP: failed to set TCP_NODELAY on socket %i", s);
+}
+
 /**
  * tcp_update_csum() - Calculate TCP checksum
  * @psum:	Unfolded partial checksum of the IPv4 or IPv6 pseudo-header
@@ -1285,6 +1298,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
 		return -errno;
 
 	tcp_sock_set_bufsize(c, s);
+	tcp_sock_set_nodelay(s);
 
 	return s;
 }
@@ -2058,6 +2072,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
 		goto cancel;
 
 	tcp_sock_set_bufsize(c, s);
+	tcp_sock_set_nodelay(s);
 
 	/* FIXME: When listening port has a specific bound address, record that
 	 * as our address
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets
  2025-01-20 17:28 [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets Stefano Brivio
@ 2025-01-21  3:01 ` David Gibson
  2025-01-21  8:40   ` Stefano Brivio
  2025-01-21 12:42 ` David Gibson
  1 sibling, 1 reply; 5+ messages in thread
From: David Gibson @ 2025-01-21  3:01 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3211 bytes --]

On Mon, Jan 20, 2025 at 06:28:16PM +0100, Stefano Brivio wrote:
> Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
> both sides"), David argues that, in general, we don't know what kind
> of TCP traffic we're dealing with, on any side or path.
> 
> TCP segments might have been delivered to our socket with a PSH flag,
> but we don't have a way to know about it.
> 
> Similarly, the guest might send us segments with PSH or URG set, but
> we don't know if we should generally TCP_CORK sockets and uncork on
> those flags, because that would assume they're running a Linux kernel
> (and a particular version of it) matching the kernel that delivers
> outbound packets for us.
> 
> Given that we can't make any assumption and everything might very well
> be interactive traffic, disable Nagle's algorithm on all non-spliced
> sockets as well.
> 
> After all, John Nagle himself is nowadays recommending that delayed
> ACKs should never be enabled together with his algorithm, but we
> don't have a practical way to ensure that our environment is free from
> delayed ACKs (TCP_QUICKACK is not really usable for this purpose):
> 
>   https://news.ycombinator.com/item?id=34180239
> 
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> v2: Set TCP_NODELAY on inbound socket after accept4(), not on the
>     listening sockets, and change failure message to debug()
>     instead of trace()

Uh.. you've moved it to after accept() for incoming connections, but
it looks like you've accidentally removed the call for outgoing
connections (tcp_conn_new_sock()).

> 
>  tcp.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/tcp.c b/tcp.c
> index a012b81..4d6a6b3 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -756,6 +756,19 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
>  		trace("TCP: failed to set SO_SNDBUF to %i", v);
>  }
>  
> +/**
> + * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
> + * @s:		Socket, can be -1 to avoid check in the caller
> + */
> +static void tcp_sock_set_nodelay(int s)
> +{
> +	if (s == -1)
> +		return;
> +
> +	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &((int){ 1 }), sizeof(int)))
> +		debug("TCP: failed to set TCP_NODELAY on socket %i", s);
> +}
> +
>  /**
>   * tcp_update_csum() - Calculate TCP checksum
>   * @psum:	Unfolded partial checksum of the IPv4 or IPv6 pseudo-header
> @@ -1285,6 +1298,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
>  		return -errno;
>  
>  	tcp_sock_set_bufsize(c, s);
> +	tcp_sock_set_nodelay(s);
>  
>  	return s;
>  }
> @@ -2058,6 +2072,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
>  		goto cancel;
>  
>  	tcp_sock_set_bufsize(c, s);
> +	tcp_sock_set_nodelay(s);
>  
>  	/* FIXME: When listening port has a specific bound address, record that
>  	 * as our address

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets
  2025-01-21  3:01 ` David Gibson
@ 2025-01-21  8:40   ` Stefano Brivio
  2025-01-21 12:42     ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2025-01-21  8:40 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue, 21 Jan 2025 13:31:03 +1030
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Jan 20, 2025 at 06:28:16PM +0100, Stefano Brivio wrote:
> > Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
> > both sides"), David argues that, in general, we don't know what kind
> > of TCP traffic we're dealing with, on any side or path.
> > 
> > TCP segments might have been delivered to our socket with a PSH flag,
> > but we don't have a way to know about it.
> > 
> > Similarly, the guest might send us segments with PSH or URG set, but
> > we don't know if we should generally TCP_CORK sockets and uncork on
> > those flags, because that would assume they're running a Linux kernel
> > (and a particular version of it) matching the kernel that delivers
> > outbound packets for us.
> > 
> > Given that we can't make any assumption and everything might very well
> > be interactive traffic, disable Nagle's algorithm on all non-spliced
> > sockets as well.
> > 
> > After all, John Nagle himself is nowadays recommending that delayed
> > ACKs should never be enabled together with his algorithm, but we
> > don't have a practical way to ensure that our environment is free from
> > delayed ACKs (TCP_QUICKACK is not really usable for this purpose):
> > 
> >   https://news.ycombinator.com/item?id=34180239
> > 
> > Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > v2: Set TCP_NODELAY on inbound socket after accept4(), not on the
> >     listening sockets, and change failure message to debug()
> >     instead of trace()  
> 
> Uh.. you've moved it to after accept() for incoming connections, but
> it looks like you've accidentally removed the call for outgoing
> connections (tcp_conn_new_sock()).

Uh... no? It's here:

> >  tcp.c | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index a012b81..4d6a6b3 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -756,6 +756,19 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
> >  		trace("TCP: failed to set SO_SNDBUF to %i", v);
> >  }
> >  
> > +/**
> > + * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
> > + * @s:		Socket, can be -1 to avoid check in the caller
> > + */
> > +static void tcp_sock_set_nodelay(int s)
> > +{
> > +	if (s == -1)
> > +		return;
> > +
> > +	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &((int){ 1 }), sizeof(int)))
> > +		debug("TCP: failed to set TCP_NODELAY on socket %i", s);
> > +}
> > +
> >  /**
> >   * tcp_update_csum() - Calculate TCP checksum
> >   * @psum:	Unfolded partial checksum of the IPv4 or IPv6 pseudo-header
> > @@ -1285,6 +1298,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
                                     ^^^^^^^^^^^^^^^^^

> >  		return -errno;
> >  
> >  	tcp_sock_set_bufsize(c, s);
> > +	tcp_sock_set_nodelay(s);
        ^^^^^^^^^^^^^^^^^^^^^^^^

...am I missing something?

> >  
> >  	return s;
> >  }
> > @@ -2058,6 +2072,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
> >  		goto cancel;
> >  
> >  	tcp_sock_set_bufsize(c, s);
> > +	tcp_sock_set_nodelay(s);
> >  
> >  	/* FIXME: When listening port has a specific bound address, record that
> >  	 * as our address  

-- 
Stefano


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets
  2025-01-21  8:40   ` Stefano Brivio
@ 2025-01-21 12:42     ` David Gibson
  0 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2025-01-21 12:42 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3975 bytes --]

On Tue, Jan 21, 2025 at 09:40:02AM +0100, Stefano Brivio wrote:
> On Tue, 21 Jan 2025 13:31:03 +1030
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Jan 20, 2025 at 06:28:16PM +0100, Stefano Brivio wrote:
> > > Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
> > > both sides"), David argues that, in general, we don't know what kind
> > > of TCP traffic we're dealing with, on any side or path.
> > > 
> > > TCP segments might have been delivered to our socket with a PSH flag,
> > > but we don't have a way to know about it.
> > > 
> > > Similarly, the guest might send us segments with PSH or URG set, but
> > > we don't know if we should generally TCP_CORK sockets and uncork on
> > > those flags, because that would assume they're running a Linux kernel
> > > (and a particular version of it) matching the kernel that delivers
> > > outbound packets for us.
> > > 
> > > Given that we can't make any assumption and everything might very well
> > > be interactive traffic, disable Nagle's algorithm on all non-spliced
> > > sockets as well.
> > > 
> > > After all, John Nagle himself is nowadays recommending that delayed
> > > ACKs should never be enabled together with his algorithm, but we
> > > don't have a practical way to ensure that our environment is free from
> > > delayed ACKs (TCP_QUICKACK is not really usable for this purpose):
> > > 
> > >   https://news.ycombinator.com/item?id=34180239
> > > 
> > > Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > > v2: Set TCP_NODELAY on inbound socket after accept4(), not on the
> > >     listening sockets, and change failure message to debug()
> > >     instead of trace()  
> > 
> > Uh.. you've moved it to after accept() for incoming connections, but
> > it looks like you've accidentally removed the call for outgoing
> > connections (tcp_conn_new_sock()).
> 
> Uh... no? It's here:
> 
> > >  tcp.c | 15 +++++++++++++++
> > >  1 file changed, 15 insertions(+)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index a012b81..4d6a6b3 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -756,6 +756,19 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
> > >  		trace("TCP: failed to set SO_SNDBUF to %i", v);
> > >  }
> > >  
> > > +/**
> > > + * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
> > > + * @s:		Socket, can be -1 to avoid check in the caller
> > > + */
> > > +static void tcp_sock_set_nodelay(int s)
> > > +{
> > > +	if (s == -1)
> > > +		return;
> > > +
> > > +	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &((int){ 1 }), sizeof(int)))
> > > +		debug("TCP: failed to set TCP_NODELAY on socket %i", s);
> > > +}
> > > +
> > >  /**
> > >   * tcp_update_csum() - Calculate TCP checksum
> > >   * @psum:	Unfolded partial checksum of the IPv4 or IPv6 pseudo-header
> > > @@ -1285,6 +1298,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
>                                      ^^^^^^^^^^^^^^^^^
> 
> > >  		return -errno;
> > >  
> > >  	tcp_sock_set_bufsize(c, s);
> > > +	tcp_sock_set_nodelay(s);
>         ^^^^^^^^^^^^^^^^^^^^^^^^
> 
> ...am I missing something?

Oops, no, I am.  The page break fell at an inconvenient spot which
made my eye skate over the relevant hunk, sorry.

> 
> > >  
> > >  	return s;
> > >  }
> > > @@ -2058,6 +2072,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
> > >  		goto cancel;
> > >  
> > >  	tcp_sock_set_bufsize(c, s);
> > > +	tcp_sock_set_nodelay(s);
> > >  
> > >  	/* FIXME: When listening port has a specific bound address, record that
> > >  	 * as our address  
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets
  2025-01-20 17:28 [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets Stefano Brivio
  2025-01-21  3:01 ` David Gibson
@ 2025-01-21 12:42 ` David Gibson
  1 sibling, 0 replies; 5+ messages in thread
From: David Gibson @ 2025-01-21 12:42 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3097 bytes --]

On Mon, Jan 20, 2025 at 06:28:16PM +0100, Stefano Brivio wrote:
> Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
> both sides"), David argues that, in general, we don't know what kind
> of TCP traffic we're dealing with, on any side or path.
> 
> TCP segments might have been delivered to our socket with a PSH flag,
> but we don't have a way to know about it.
> 
> Similarly, the guest might send us segments with PSH or URG set, but
> we don't know if we should generally TCP_CORK sockets and uncork on
> those flags, because that would assume they're running a Linux kernel
> (and a particular version of it) matching the kernel that delivers
> outbound packets for us.
> 
> Given that we can't make any assumption and everything might very well
> be interactive traffic, disable Nagle's algorithm on all non-spliced
> sockets as well.
> 
> After all, John Nagle himself is nowadays recommending that delayed
> ACKs should never be enabled together with his algorithm, but we
> don't have a practical way to ensure that our environment is free from
> delayed ACKs (TCP_QUICKACK is not really usable for this purpose):
> 
>   https://news.ycombinator.com/item?id=34180239
> 
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
> v2: Set TCP_NODELAY on inbound socket after accept4(), not on the
>     listening sockets, and change failure message to debug()
>     instead of trace()
> 
>  tcp.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/tcp.c b/tcp.c
> index a012b81..4d6a6b3 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -756,6 +756,19 @@ static void tcp_sock_set_bufsize(const struct ctx *c, int s)
>  		trace("TCP: failed to set SO_SNDBUF to %i", v);
>  }
>  
> +/**
> + * tcp_sock_set_nodelay() - Set TCP_NODELAY option (disable Nagle's algorithm)
> + * @s:		Socket, can be -1 to avoid check in the caller
> + */
> +static void tcp_sock_set_nodelay(int s)
> +{
> +	if (s == -1)
> +		return;
> +
> +	if (setsockopt(s, SOL_TCP, TCP_NODELAY, &((int){ 1 }), sizeof(int)))
> +		debug("TCP: failed to set TCP_NODELAY on socket %i", s);
> +}
> +
>  /**
>   * tcp_update_csum() - Calculate TCP checksum
>   * @psum:	Unfolded partial checksum of the IPv4 or IPv6 pseudo-header
> @@ -1285,6 +1298,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
>  		return -errno;
>  
>  	tcp_sock_set_bufsize(c, s);
> +	tcp_sock_set_nodelay(s);
>  
>  	return s;
>  }
> @@ -2058,6 +2072,7 @@ void tcp_listen_handler(const struct ctx *c, union epoll_ref ref,
>  		goto cancel;
>  
>  	tcp_sock_set_bufsize(c, s);
> +	tcp_sock_set_nodelay(s);
>  
>  	/* FIXME: When listening port has a specific bound address, record that
>  	 * as our address

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-01-21 12:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-20 17:28 [PATCH v2] tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets Stefano Brivio
2025-01-21  3:01 ` David Gibson
2025-01-21  8:40   ` Stefano Brivio
2025-01-21 12:42     ` David Gibson
2025-01-21 12:42 ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).