[PATCH 0/1] RFC: Clarifying seq_ack_to

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH 0/1] RFC: Clarifying seq_ack_to_tap logic
@ 2025-10-03  6:30 David Gibson
  2025-10-03  6:30 ` [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2025-10-03  6:30 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Prompted by the discussion on Stefano's recent TCP fixes, here's
change I think clarifies our logic.  This goes on top of my linter
fixes (specifically the cppcheck suppression in the vicinity) but
before Stefano's TCP fixes.

Happy to have it folded into those TCP fixes, or reorganised around
them.

Stefano, please review my large comment.  I'm confident on some of the
points in there, but not all of them.

David Gibson (1):
  tcp: Clarify logic calculating how much guest data to ack

 tcp.c | 68 ++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 26 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack
  2025-10-03  6:30 [PATCH 0/1] RFC: Clarifying seq_ack_to_tap logic David Gibson
@ 2025-10-03  6:30 ` David Gibson
  2025-10-07 22:42   ` Stefano Brivio
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2025-10-03  6:30 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This is fairly complex, because we have a method we prefer but we need to
fall back to a simpler one in a bunch of cases.  Slightly reorganise the
code to make the flow clearer, and add a large comment giving the
rationale.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 68 ++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 26 deletions(-)

diff --git a/tcp.c b/tcp.c
index 7da41797..85eb2c32 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1014,35 +1014,51 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
 	int s = conn->sock;
 
-	if (!bytes_acked_cap) {
-		conn->seq_ack_to_tap = conn->seq_from_tap;
-		if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
-			conn->seq_ack_to_tap = prev_ack_to_tap;
-	} else {
-		if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL ||
-		    tcp_rtt_dst_low(conn) || CONN_IS_CLOSING(conn) ||
-		    (conn->flags & LOCAL) || force_seq) {
-			conn->seq_ack_to_tap = conn->seq_from_tap;
-		} else if (conn->seq_ack_to_tap != conn->seq_from_tap) {
-			if (!tinfo) {
-				tinfo = &tinfo_new;
-				if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
-					return 0;
-			}
-
-			/* This trips a cppcheck bug in some versions, including
-			 * cppcheck 2.18.3.
-			 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
-			 */
-			/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
-			conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
-				conn->seq_init_from_tap;
-
-			if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
-				conn->seq_ack_to_tap = prev_ack_to_tap;
+	/* At this point we could ack all the data we've accepted for forwarding
+	 * (seq_from_tap).  When possible, however, we want to only ack what the
+	 * peer has acked.  This makes it appear to the guest more like a direct
+	 * connection to the peer, and may improve flow control behaviour.
+	 *
+	 * For it to be possible and worth it we need:
+	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
+	 *  - Not to be told not to (force_seq)
+	 *  - Not half-closed in the peer->guest direction
+	 *      With no data coming from the peer, we won't get further events
+	 *      which would prompt us to recheck bytes_acked.  We could poll on
+	 *      a timer, but that's more trouble than it's worth.
+	 *  - Not a host local connection
+	 *      Data goes directly from socket to socket in this case, with
+	 *      nothing meaningful "in flight".
+	 *  - Large enough send buffer
+	 *      If this is small, there's not enough in flight to bother.
+	 */
+	if (bytes_acked_cap && !force_seq &&
+	    !CONN_IS_CLOSING(conn) &&
+	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
+	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
+		if (!tinfo) {
+			tinfo = &tinfo_new;
+			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
+				return 0;
 		}
+
+		/* This trips a cppcheck bug in some versions, including
+		 * cppcheck 2.18.3.
+		 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
+		 */
+		/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
+		conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
+			conn->seq_init_from_tap;
+	} else {
+		/* Fall back to acking everything we have */
+		conn->seq_ack_to_tap = conn->seq_from_tap;
 	}
 
+	/* If the guest is retransmitting, don't let our ACKed sequence go
+	 * backwards */
+	if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
+		conn->seq_ack_to_tap = prev_ack_to_tap;
+
 	if (!snd_wnd_cap) {
 		tcp_get_sndbuf(conn);
 		new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack
  2025-10-03  6:30 ` [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack David Gibson
@ 2025-10-07 22:42   ` Stefano Brivio
  2025-10-08  1:21     ` David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: Stefano Brivio @ 2025-10-07 22:42 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri,  3 Oct 2025 16:30:51 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> This is fairly complex, because we have a method we prefer but we need to
> fall back to a simpler one in a bunch of cases.  Slightly reorganise the
> code to make the flow clearer, and add a large comment giving the
> rationale.

I think this is a strict improvement on the original and I was about to
apply it regardless of my pending series with TCP fixes (it looks
completely independent to me) and a few nits I had, but then I noticed
one bit that might be substantially misleading, at the end.

So here come all my comments:

> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c | 68 ++++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 42 insertions(+), 26 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 7da41797..85eb2c32 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1014,35 +1014,51 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
>  	int s = conn->sock;
>  
> -	if (!bytes_acked_cap) {
> -		conn->seq_ack_to_tap = conn->seq_from_tap;
> -		if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> -			conn->seq_ack_to_tap = prev_ack_to_tap;
> -	} else {
> -		if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL ||
> -		    tcp_rtt_dst_low(conn) || CONN_IS_CLOSING(conn) ||
> -		    (conn->flags & LOCAL) || force_seq) {
> -			conn->seq_ack_to_tap = conn->seq_from_tap;
> -		} else if (conn->seq_ack_to_tap != conn->seq_from_tap) {
> -			if (!tinfo) {
> -				tinfo = &tinfo_new;
> -				if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> -					return 0;
> -			}
> -
> -			/* This trips a cppcheck bug in some versions, including
> -			 * cppcheck 2.18.3.
> -			 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
> -			 */
> -			/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
> -			conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
> -				conn->seq_init_from_tap;
> -
> -			if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> -				conn->seq_ack_to_tap = prev_ack_to_tap;
> +	/* At this point we could ack all the data we've accepted for forwarding
> +	 * (seq_from_tap).  When possible, however, we want to only ack what the
> +	 * peer has acked.  This makes it appear to the guest more like a direct
> +	 * connection to the peer, and may improve flow control behaviour.

For consistency, as we don't use "ack" as a verb anywhere else, maybe
spell it out as "acknowledge" / "acknowledged".

> +	 *
> +	 * For it to be possible and worth it we need:
> +	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
> +	 *  - Not to be told not to (force_seq)
> +	 *  - Not half-closed in the peer->guest direction
> +	 *      With no data coming from the peer, we won't get further events
> +	 *      which would prompt us to recheck bytes_acked.  We could poll on
> +	 *      a timer, but that's more trouble than it's worth.

Strictly speaking, we could (and usually do) get further events
prompting us to check bytes_acked, in the form of segments from the
guest, but perhaps we can just leave this detail out for brevity,
unless you want to try and factor that in.

> +	 *  - Not a host local connection

The tcp_rtt_dst_low() is a trick to consider "local" also anything (VMs)
that's connected to us via veth.

It's not local from a network segment perspective, but it's local to
the machine, and the same consideration applies (somewhat surprisingly,
for veth). Same here, I guess we could leave this out for brevity.

> +	 *      Data goes directly from socket to socket in this case, with
> +	 *      nothing meaningful "in flight".
> +	 *  - Large enough send buffer
> +	 *      If this is small, there's not enough in flight to bother.
> +	 */
> +	if (bytes_acked_cap && !force_seq &&
> +	    !CONN_IS_CLOSING(conn) &&
> +	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> +	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> +		if (!tinfo) {
> +			tinfo = &tinfo_new;
> +			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> +				return 0;
>  		}
> +
> +		/* This trips a cppcheck bug in some versions, including
> +		 * cppcheck 2.18.3.
> +		 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
> +		 */
> +		/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
> +		conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
> +			conn->seq_init_from_tap;

Maybe fix the indentation while at it?

		conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
				       conn->seq_init_from_tap;

> +	} else {
> +		/* Fall back to acking everything we have */

Maybe specifically refer to what we got so far,

		/* Fall back to acknowledging everything we got */

?

> +		conn->seq_ack_to_tap = conn->seq_from_tap;
>  	}
>  
> +	/* If the guest is retransmitting, don't let our ACKed sequence go
> +	 * backwards */

This is the misleading part I realised about, after I mentioned it in:

  https://archives.passt.top/passt-dev/20251007003219.3f286b1d@elisabeth/

...the reason why we risk rewinding the acknowledged sequence isn't
that the guest is retransmitting, because in that case we wouldn't have
advanced conn->seq_to_tap to begin with.

The reason is that one of those conditions for using bytes_acked you
listed above happened to be false, and now it becomes true again.

The only practical one I can think of is the array used by
tcp_rtt_dst_low() getting full at some point, but later we re-insert the
peer we're talking to in the table.

By the way, for consistency:

	/* Multi-line
	 * comment
	 */

> +	if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> +		conn->seq_ack_to_tap = prev_ack_to_tap;

The reason behind the current code structure is to skip this if we
didn't touch conn->seq_ack_to_tap at all, but the compiler will probably
figure this out by itself, and even if it doesn't, I guess it's more
efficient to do this unconditionally anyway.

> +
>  	if (!snd_wnd_cap) {
>  		tcp_get_sndbuf(conn);
>  		new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW);

-- 
Stefano


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack
  2025-10-07 22:42   ` Stefano Brivio
@ 2025-10-08  1:21     ` David Gibson
  0 siblings, 0 replies; 4+ messages in thread
From: David Gibson @ 2025-10-08  1:21 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7535 bytes --]

On Wed, Oct 08, 2025 at 12:42:12AM +0200, Stefano Brivio wrote:
> On Fri,  3 Oct 2025 16:30:51 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > This is fairly complex, because we have a method we prefer but we need to
> > fall back to a simpler one in a bunch of cases.  Slightly reorganise the
> > code to make the flow clearer, and add a large comment giving the
> > rationale.
> 
> I think this is a strict improvement on the original and I was about to
> apply it regardless of my pending series with TCP fixes (it looks
> completely independent to me) and a few nits I had, but then I noticed
> one bit that might be substantially misleading, at the end.
> 
> So here come all my comments:
> 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c | 68 ++++++++++++++++++++++++++++++++++++-----------------------
> >  1 file changed, 42 insertions(+), 26 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 7da41797..85eb2c32 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1014,35 +1014,51 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
> >  	int s = conn->sock;
> >  
> > -	if (!bytes_acked_cap) {
> > -		conn->seq_ack_to_tap = conn->seq_from_tap;
> > -		if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> > -			conn->seq_ack_to_tap = prev_ack_to_tap;
> > -	} else {
> > -		if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL ||
> > -		    tcp_rtt_dst_low(conn) || CONN_IS_CLOSING(conn) ||
> > -		    (conn->flags & LOCAL) || force_seq) {
> > -			conn->seq_ack_to_tap = conn->seq_from_tap;
> > -		} else if (conn->seq_ack_to_tap != conn->seq_from_tap) {
> > -			if (!tinfo) {
> > -				tinfo = &tinfo_new;
> > -				if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> > -					return 0;
> > -			}
> > -
> > -			/* This trips a cppcheck bug in some versions, including
> > -			 * cppcheck 2.18.3.
> > -			 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
> > -			 */
> > -			/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
> > -			conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
> > -				conn->seq_init_from_tap;
> > -
> > -			if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> > -				conn->seq_ack_to_tap = prev_ack_to_tap;
> > +	/* At this point we could ack all the data we've accepted for forwarding
> > +	 * (seq_from_tap).  When possible, however, we want to only ack what the
> > +	 * peer has acked.  This makes it appear to the guest more like a direct
> > +	 * connection to the peer, and may improve flow control behaviour.
> 
> For consistency, as we don't use "ack" as a verb anywhere else, maybe
> spell it out as "acknowledge" / "acknowledged".

We don't in comments, but there is TAP_FIN_ACKED and bytes_acked_cap
(copied from the kernel's tcpi_bytes_acked).  Adjusted, nonetheless.

> > +	 *
> > +	 * For it to be possible and worth it we need:
> > +	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
> > +	 *  - Not to be told not to (force_seq)
> > +	 *  - Not half-closed in the peer->guest direction
> > +	 *      With no data coming from the peer, we won't get further events
> > +	 *      which would prompt us to recheck bytes_acked.  We could poll on
> > +	 *      a timer, but that's more trouble than it's worth.
> 
> Strictly speaking, we could (and usually do) get further events
> prompting us to check bytes_acked, in the form of segments from the
> guest, but perhaps we can just leave this detail out for brevity,
> unless you want to try and factor that in.

Good point.  I was thinking about the fact that we don't get events
for the fact that bytes_acked has changed of itself, but the comment
doesn't make that clear.  I've tweaked the wording.

> > +	 *  - Not a host local connection
> 
> The tcp_rtt_dst_low() is a trick to consider "local" also anything (VMs)
> that's connected to us via veth.
> 
> It's not local from a network segment perspective, but it's local to
> the machine, and the same consideration applies (somewhat surprisingly,
> for veth). Same here, I guess we could leave this out for brevity.

I've adjusted the wording in a way I think is better, but it will want
a re-review.

> 
> > +	 *      Data goes directly from socket to socket in this case, with
> > +	 *      nothing meaningful "in flight".
> > +	 *  - Large enough send buffer
> > +	 *      If this is small, there's not enough in flight to bother.
> > +	 */
> > +	if (bytes_acked_cap && !force_seq &&
> > +	    !CONN_IS_CLOSING(conn) &&
> > +	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> > +	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> > +		if (!tinfo) {
> > +			tinfo = &tinfo_new;
> > +			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> > +				return 0;
> >  		}
> > +
> > +		/* This trips a cppcheck bug in some versions, including
> > +		 * cppcheck 2.18.3.
> > +		 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
> > +		 */
> > +		/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
> > +		conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
> > +			conn->seq_init_from_tap;
> 
> Maybe fix the indentation while at it?
> 
> 		conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
> 				       conn->seq_init_from_tap;

Ah, sure.  That detail of our style isn't known by my editor, so I
missed it.

> > +	} else {
> > +		/* Fall back to acking everything we have */
> 
> Maybe specifically refer to what we got so far,
> 
> 		/* Fall back to acknowledging everything we got */
> 
> ?

Uh, sure, done.

> > +		conn->seq_ack_to_tap = conn->seq_from_tap;
> >  	}
> >  
> > +	/* If the guest is retransmitting, don't let our ACKed sequence go
> > +	 * backwards */
> 
> This is the misleading part I realised about, after I mentioned it in:
> 
>   https://archives.passt.top/passt-dev/20251007003219.3f286b1d@elisabeth/
> 
> ...the reason why we risk rewinding the acknowledged sequence isn't
> that the guest is retransmitting, because in that case we wouldn't have
> advanced conn->seq_to_tap to begin with.

Right, I misunderstood the situation in which this logic is needed.

> The reason is that one of those conditions for using bytes_acked you
> listed above happened to be false, and now it becomes true again.

Aha!

> The only practical one I can think of is the array used by
> tcp_rtt_dst_low() getting full at some point, but later we re-insert the
> peer we're talking to in the table.

Comment updated.

> By the way, for consistency:
> 
> 	/* Multi-line
> 	 * comment
> 	 */

Done.

> > +	if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> > +		conn->seq_ack_to_tap = prev_ack_to_tap;
> 
> The reason behind the current code structure is to skip this if we
> didn't touch conn->seq_ack_to_tap at all, but the compiler will probably
> figure this out by itself, and even if it doesn't, I guess it's more
> efficient to do this unconditionally anyway.

That was my thinking.  It should all be nearly-free in-register
integer arithmetic, plus a single probably-L1-hot store.  This way
makes for less conditionals to parse, primarily for the human reader
though also for the CPU.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-10-08  1:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-03  6:30 [PATCH 0/1] RFC: Clarifying seq_ack_to_tap logic David Gibson
2025-10-03  6:30 ` [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack David Gibson
2025-10-07 22:42   ` Stefano Brivio
2025-10-08  1:21     ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).