[PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues
@ 2025-10-02  0:06 Stefano Brivio
  2025-10-02  0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Stefano Brivio @ 2025-10-02  0:06 UTC (permalink / raw)
  To: passt-dev

The most important fix here is 3/4, the other ones are either fixes
for issues that weren't actually observed in practice, but I mistook
them as the case fixed by 3/4, or for minor details.

Stefano Brivio (4):
  tcp: Fix ACK sequence on FIN to tap
  tcp: Completely ignore data segment in CLOSE-WAIT state, log a message
  tcp: Don't consider FIN flags with mismatching sequence
  tcp: On partial send (incomplete sendmsg()), request a retransmission
    right away

 tcp.c     | 12 +++++++++---
 tcp_buf.c | 14 +++++++++++++-
 tcp_vu.c  |  7 ++++++-
 3 files changed, 28 insertions(+), 5 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-02  0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio
@ 2025-10-02  0:06 ` Stefano Brivio
  2025-10-02  2:41   ` David Gibson
  2025-10-02  0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-02  0:06 UTC (permalink / raw)
  To: passt-dev

If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
send a FIN segment to the guest / container acknowledging a sequence
number that's behind what we received so far, we won't have any
further trigger to send an updated ACK segment, as we are now
switching the epoll socket monitoring to edge-triggered mode.

To avoid this situation, in tcp_update_seqack_wnd(), we set the next
acknowledgement sequence to the current observed sequence, regardless
of what was acknowledged socket-side.

However, we don't necessarily call tcp_update_seqack_wnd() before
sending the FIN segment, which might potentially lead to a situation,
not observed in practice, where we unnecessarily cause a
retransmission at some point after our FIN segment.

Avoid that by setting the ACK sequence to whatever we received from
the container / guest, before sending a FIN segment and switching to
EPOLLET.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp_buf.c | 14 +++++++++++++-
 tcp_vu.c  |  7 ++++++-
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c
index a493b5a..cc106bc 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			conn_flag(c, conn, STALLED);
 		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
 			   SOCK_FIN_RCVD) {
-			int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
+			int ret;
+
+			/* On TAP_FIN_SENT, we won't get further data events
+			 * from the socket, and this might be the last ACK
+			 * segment we send to the tap, so update its sequence to
+			 * include everything we received until now.
+			 *
+			 * See also the special handling on CONN_IS_CLOSING() in
+			 * tcp_update_seqack_wnd().
+			 */
+			conn->seq_ack_to_tap = conn->seq_from_tap;
+
+			ret = tcp_buf_send_flag(c, conn, FIN | ACK);
 			if (ret) {
 				tcp_rst(c, conn);
 				return ret;
diff --git a/tcp_vu.c b/tcp_vu.c
index ebd3a1e..3ec3538 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
 			conn_flag(c, conn, STALLED);
 		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
 			   SOCK_FIN_RCVD) {
-			int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
+			int ret;
+
+			/* See tcp_buf_data_from_sock() */
+			conn->seq_ack_to_tap = conn->seq_from_tap;
+
+			ret = tcp_vu_send_flag(c, conn, FIN | ACK);
 			if (ret) {
 				tcp_rst(c, conn);
 				return ret;
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-02  0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio
@ 2025-10-02  2:41   ` David Gibson
  2025-10-02 11:58     ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-02  2:41 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3871 bytes --]

On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:
> If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
> send a FIN segment to the guest / container acknowledging a sequence
> number that's behind what we received so far, we won't have any
> further trigger to send an updated ACK segment, as we are now
> switching the epoll socket monitoring to edge-triggered mode.
> 
> To avoid this situation, in tcp_update_seqack_wnd(), we set the next
> acknowledgement sequence to the current observed sequence, regardless
> of what was acknowledged socket-side.

To double check my understanding: things should work if we always
acknowledged everything we've received.  Acknowledging only what the
peer has acked is a refinement to give the guest a view that's closer
to what it would be end-to-end with the peer (which might improve the
operation of flow control).

We can't use that refined mechanism when the socket is closing
(amongst other cases), because while we can get the peer acked bytes
from TCP_INFO, we can't get events when that changes, so we have no
mechanism to provide updates to the guest at the right time.  So we
fall back to the simpler method.

Is that correct?

> However, we don't necessarily call tcp_update_seqack_wnd() before
> sending the FIN segment, which might potentially lead to a situation,
> not observed in practice, where we unnecessarily cause a
> retransmission at some point after our FIN segment.
> 
> Avoid that by setting the ACK sequence to whatever we received from
> the container / guest, before sending a FIN segment and switching to
> EPOLLET.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Based on my understanding above, this looks correct to me, so,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

My only concern is whether we could instead insert an extra call to
tcp_update_seqack_wnd() to reduce duplicated logic.

> ---
>  tcp_buf.c | 14 +++++++++++++-
>  tcp_vu.c  |  7 ++++++-
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/tcp_buf.c b/tcp_buf.c
> index a493b5a..cc106bc 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  			conn_flag(c, conn, STALLED);
>  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
>  			   SOCK_FIN_RCVD) {
> -			int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> +			int ret;
> +
> +			/* On TAP_FIN_SENT, we won't get further data events
> +			 * from the socket, and this might be the last ACK
> +			 * segment we send to the tap, so update its sequence to
> +			 * include everything we received until now.
> +			 *
> +			 * See also the special handling on CONN_IS_CLOSING() in
> +			 * tcp_update_seqack_wnd().
> +			 */
> +			conn->seq_ack_to_tap = conn->seq_from_tap;
> +
> +			ret = tcp_buf_send_flag(c, conn, FIN | ACK);
>  			if (ret) {
>  				tcp_rst(c, conn);
>  				return ret;
> diff --git a/tcp_vu.c b/tcp_vu.c
> index ebd3a1e..3ec3538 100644
> --- a/tcp_vu.c
> +++ b/tcp_vu.c
> @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
>  			conn_flag(c, conn, STALLED);
>  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
>  			   SOCK_FIN_RCVD) {
> -			int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
> +			int ret;
> +
> +			/* See tcp_buf_data_from_sock() */
> +			conn->seq_ack_to_tap = conn->seq_from_tap;
> +
> +			ret = tcp_vu_send_flag(c, conn, FIN | ACK);
>  			if (ret) {
>  				tcp_rst(c, conn);
>  				return ret;
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-02  2:41   ` David Gibson
@ 2025-10-02 11:58     ` Stefano Brivio
  2025-10-03  3:19       ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-02 11:58 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 2 Oct 2025 12:41:08 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:
> > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
> > send a FIN segment to the guest / container acknowledging a sequence
> > number that's behind what we received so far, we won't have any
> > further trigger to send an updated ACK segment, as we are now
> > switching the epoll socket monitoring to edge-triggered mode.
> > 
> > To avoid this situation, in tcp_update_seqack_wnd(), we set the next
> > acknowledgement sequence to the current observed sequence, regardless
> > of what was acknowledged socket-side.  
> 
> To double check my understanding: things should work if we always
> acknowledged everything we've received.  Acknowledging only what the
> peer has acked is a refinement to give the guest a view that's closer
> to what it would be end-to-end with the peer (which might improve the
> operation of flow control).

Right.

> We can't use that refined mechanism when the socket is closing
> (amongst other cases), because while we can get the peer acked bytes
> from TCP_INFO, we can't get events when that changes, so we have no
> mechanism to provide updates to the guest at the right time.  So we
> fall back to the simpler method.
> 
> Is that correct?

Also correct, yes. If you have a better idea to summarise this in the
comment in tcp_buf_data_from_sock() let me know. Maybe I could mention
EPOLLET explicitly there?

> > However, we don't necessarily call tcp_update_seqack_wnd() before
> > sending the FIN segment, which might potentially lead to a situation,
> > not observed in practice, where we unnecessarily cause a
> > retransmission at some point after our FIN segment.
> > 
> > Avoid that by setting the ACK sequence to whatever we received from
> > the container / guest, before sending a FIN segment and switching to
> > EPOLLET.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> 
> Based on my understanding above, this looks correct to me, so,
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> My only concern is whether we could instead insert an extra call to
> tcp_update_seqack_wnd() to reduce duplicated logic.

Hmm, maybe, but on the other hand we're closing the connection. Should
we really spend time querying TCP_INFO to recalculate the window at
this point? I wouldn't.

> > ---
> >  tcp_buf.c | 14 +++++++++++++-
> >  tcp_vu.c  |  7 ++++++-
> >  2 files changed, 19 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tcp_buf.c b/tcp_buf.c
> > index a493b5a..cc106bc 100644
> > --- a/tcp_buf.c
> > +++ b/tcp_buf.c
> > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> >  			conn_flag(c, conn, STALLED);
> >  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> >  			   SOCK_FIN_RCVD) {
> > -			int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> > +			int ret;
> > +
> > +			/* On TAP_FIN_SENT, we won't get further data events
> > +			 * from the socket, and this might be the last ACK
> > +			 * segment we send to the tap, so update its sequence to
> > +			 * include everything we received until now.
> > +			 *
> > +			 * See also the special handling on CONN_IS_CLOSING() in
> > +			 * tcp_update_seqack_wnd().
> > +			 */
> > +			conn->seq_ack_to_tap = conn->seq_from_tap;
> > +
> > +			ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> >  			if (ret) {
> >  				tcp_rst(c, conn);
> >  				return ret;
> > diff --git a/tcp_vu.c b/tcp_vu.c
> > index ebd3a1e..3ec3538 100644
> > --- a/tcp_vu.c
> > +++ b/tcp_vu.c
> > @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> >  			conn_flag(c, conn, STALLED);
> >  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> >  			   SOCK_FIN_RCVD) {
> > -			int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
> > +			int ret;
> > +
> > +			/* See tcp_buf_data_from_sock() */
> > +			conn->seq_ack_to_tap = conn->seq_from_tap;
> > +
> > +			ret = tcp_vu_send_flag(c, conn, FIN | ACK);
> >  			if (ret) {
> >  				tcp_rst(c, conn);
> >  				return ret;
> > -- 
> > 2.43.0
> >   

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-02 11:58     ` Stefano Brivio
@ 2025-10-03  3:19       ` David Gibson
  2025-10-06 22:32         ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-03  3:19 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5560 bytes --]

On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote:
> On Thu, 2 Oct 2025 12:41:08 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:
> > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
> > > send a FIN segment to the guest / container acknowledging a sequence
> > > number that's behind what we received so far, we won't have any
> > > further trigger to send an updated ACK segment, as we are now
> > > switching the epoll socket monitoring to edge-triggered mode.
> > > 
> > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next
> > > acknowledgement sequence to the current observed sequence, regardless
> > > of what was acknowledged socket-side.  
> > 
> > To double check my understanding: things should work if we always
> > acknowledged everything we've received.  Acknowledging only what the
> > peer has acked is a refinement to give the guest a view that's closer
> > to what it would be end-to-end with the peer (which might improve the
> > operation of flow control).
> 
> Right.
> 
> > We can't use that refined mechanism when the socket is closing
> > (amongst other cases), because while we can get the peer acked bytes
> > from TCP_INFO, we can't get events when that changes, so we have no
> > mechanism to provide updates to the guest at the right time.  So we
> > fall back to the simpler method.
> > 
> > Is that correct?
> 
> Also correct, yes. If you have a better idea to summarise this in the
> comment in tcp_buf_data_from_sock() let me know.

Hm, I might.  Or actually a way to reorganise the code that I think
will be a bit clearer and probably allow a clearer comment too.

> Maybe I could mention
> EPOLLET explicitly there?

I don't think EPOLLET is actually relevant.  Even if we had level
triggered events, a change in bytes_acked doesn't count as an event
(AFAIK).  So either some other event is on, in which case we'd
effectively be busy polling bytes_acked, or it's not in which case we
don't get updates, just like now.

I principle we could implement some sort of timer based polling, but
that sounds like way more trouble than it's worth.

> > > However, we don't necessarily call tcp_update_seqack_wnd() before
> > > sending the FIN segment, which might potentially lead to a situation,
> > > not observed in practice, where we unnecessarily cause a
> > > retransmission at some point after our FIN segment.
> > > 
> > > Avoid that by setting the ACK sequence to whatever we received from
> > > the container / guest, before sending a FIN segment and switching to
> > > EPOLLET.
> > > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> > 
> > Based on my understanding above, this looks correct to me, so,
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > My only concern is whether we could instead insert an extra call to
> > tcp_update_seqack_wnd() to reduce duplicated logic.
> 
> Hmm, maybe, but on the other hand we're closing the connection. Should
> we really spend time querying TCP_INFO to recalculate the window at
> this point? I wouldn't.

Good point.  I mean tcp_update_seqack_wnd() could skip the TCP_INFO in
that case, but that does look a bit fiddly.

On the other hand, in favour of not duplicating logic...

[snip]
> > > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > >  			conn_flag(c, conn, STALLED);
> > >  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> > >  			   SOCK_FIN_RCVD) {
> > > -			int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> > > +			int ret;
> > > +
> > > +			/* On TAP_FIN_SENT, we won't get further data events
> > > +			 * from the socket, and this might be the last ACK
> > > +			 * segment we send to the tap, so update its sequence to
> > > +			 * include everything we received until now.
> > > +			 *
> > > +			 * See also the special handling on CONN_IS_CLOSING() in
> > > +			 * tcp_update_seqack_wnd().
> > > +			 */
> > > +			conn->seq_ack_to_tap = conn->seq_from_tap;

... the equivalent bits in tcp_update_seqack_wnd() have after them:
	if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
		conn->seq_ack_to_tap = prev_ack_to_tap;

Don't we need that here as well, in case the guest is retransmitting
when we get the sock side FIN?

> > > +
> > > +			ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> > >  			if (ret) {
> > >  				tcp_rst(c, conn);
> > >  				return ret;
> > > diff --git a/tcp_vu.c b/tcp_vu.c
> > > index ebd3a1e..3ec3538 100644
> > > --- a/tcp_vu.c
> > > +++ b/tcp_vu.c
> > > @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > >  			conn_flag(c, conn, STALLED);
> > >  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> > >  			   SOCK_FIN_RCVD) {
> > > -			int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
> > > +			int ret;
> > > +
> > > +			/* See tcp_buf_data_from_sock() */
> > > +			conn->seq_ack_to_tap = conn->seq_from_tap;
> > > +
> > > +			ret = tcp_vu_send_flag(c, conn, FIN | ACK);
> > >  			if (ret) {
> > >  				tcp_rst(c, conn);
> > >  				return ret;
> > > -- 
> > > 2.43.0
> > >   
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-03  3:19       ` David Gibson
@ 2025-10-06 22:32         ` Stefano Brivio
  2025-10-06 23:31           ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-06 22:32 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 3 Oct 2025 13:19:17 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote:
> > On Thu, 2 Oct 2025 12:41:08 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:  
> > > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
> > > > send a FIN segment to the guest / container acknowledging a sequence
> > > > number that's behind what we received so far, we won't have any
> > > > further trigger to send an updated ACK segment, as we are now
> > > > switching the epoll socket monitoring to edge-triggered mode.
> > > > 
> > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next
> > > > acknowledgement sequence to the current observed sequence, regardless
> > > > of what was acknowledged socket-side.    
> > > 
> > > To double check my understanding: things should work if we always
> > > acknowledged everything we've received.  Acknowledging only what the
> > > peer has acked is a refinement to give the guest a view that's closer
> > > to what it would be end-to-end with the peer (which might improve the
> > > operation of flow control).  
> > 
> > Right.
> >   
> > > We can't use that refined mechanism when the socket is closing
> > > (amongst other cases), because while we can get the peer acked bytes
> > > from TCP_INFO, we can't get events when that changes, so we have no
> > > mechanism to provide updates to the guest at the right time.  So we
> > > fall back to the simpler method.
> > > 
> > > Is that correct?  
> > 
> > Also correct, yes. If you have a better idea to summarise this in the
> > comment in tcp_buf_data_from_sock() let me know.  
> 
> Hm, I might.  Or actually a way to reorganise the code that I think
> will be a bit clearer and probably allow a clearer comment too.

I would keep reworks for a later moment. Right now, it's already taking
me long enough to find a moment to investigate these issues, write these
fixes, and test them.

> > Maybe I could mention
> > EPOLLET explicitly there?  
> 
> I don't think EPOLLET is actually relevant.  Even if we had level
> triggered events, a change in bytes_acked doesn't count as an event
> (AFAIK).

It doesn't count, but with level-triggered events, we would be busy
polling bytes_acked, as you noted. I was mentioning EPOLLET because it
could be taken, intuitively, as a "stop listening for events" (almost)
step. I'll leave that out then.

> So either some other event is on, in which case we'd
> effectively be busy polling bytes_acked, or it's not in which case we
> don't get updates, just like now.
> 
> I principle we could implement some sort of timer based polling, but
> that sounds like way more trouble than it's worth.

We already have something similar, based on ACK_INTERVAL and
ACK_TO_TAP_DUE, and it shouldn't be overly complicated to extend that
to a new FIN_TO_TAP_DUE flag. But indeed beyond the scope of this
series.

> > > > However, we don't necessarily call tcp_update_seqack_wnd() before
> > > > sending the FIN segment, which might potentially lead to a situation,
> > > > not observed in practice, where we unnecessarily cause a
> > > > retransmission at some point after our FIN segment.
> > > > 
> > > > Avoid that by setting the ACK sequence to whatever we received from
> > > > the container / guest, before sending a FIN segment and switching to
> > > > EPOLLET.
> > > > 
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>    
> > > 
> > > Based on my understanding above, this looks correct to me, so,
> > > 
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > 
> > > My only concern is whether we could instead insert an extra call to
> > > tcp_update_seqack_wnd() to reduce duplicated logic.  
> > 
> > Hmm, maybe, but on the other hand we're closing the connection. Should
> > we really spend time querying TCP_INFO to recalculate the window at
> > this point? I wouldn't.  
> 
> Good point.  I mean tcp_update_seqack_wnd() could skip the TCP_INFO in
> that case, but that does look a bit fiddly.
> 
> On the other hand, in favour of not duplicating logic...
> 
> [snip]
> > > > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > > >  			conn_flag(c, conn, STALLED);
> > > >  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> > > >  			   SOCK_FIN_RCVD) {
> > > > -			int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> > > > +			int ret;
> > > > +
> > > > +			/* On TAP_FIN_SENT, we won't get further data events
> > > > +			 * from the socket, and this might be the last ACK
> > > > +			 * segment we send to the tap, so update its sequence to
> > > > +			 * include everything we received until now.
> > > > +			 *
> > > > +			 * See also the special handling on CONN_IS_CLOSING() in
> > > > +			 * tcp_update_seqack_wnd().
> > > > +			 */
> > > > +			conn->seq_ack_to_tap = conn->seq_from_tap;  
> 
> ... the equivalent bits in tcp_update_seqack_wnd() have after them:
> 	if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> 		conn->seq_ack_to_tap = prev_ack_to_tap;
> 
> Don't we need that here as well, in case the guest is retransmitting
> when we get the sock side FIN?

Not really, because we don't rewind conn->seq_from_tap, so we don't
risk jumping back here.

In tcp_update_seqack_wnd(), we might jump back (that should be double
checked eventually, I'm not sure it's still the case) if we happened to
acknowledge more than acknowledged socket-side while handling some
particular condition, and then we switch back to acknowledging only
bytes_acked.

It should happen if the destination is/was a low RTT one, but we run
out of slots in low_rtt_dst in favour of other entries. I don't
remember any other case.

> > > > +
> > > > +			ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> > > >  			if (ret) {
> > > >  				tcp_rst(c, conn);
> > > >  				return ret;
> > > > diff --git a/tcp_vu.c b/tcp_vu.c
> > > > index ebd3a1e..3ec3538 100644
> > > > --- a/tcp_vu.c
> > > > +++ b/tcp_vu.c
> > > > @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > > >  			conn_flag(c, conn, STALLED);
> > > >  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> > > >  			   SOCK_FIN_RCVD) {
> > > > -			int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
> > > > +			int ret;
> > > > +
> > > > +			/* See tcp_buf_data_from_sock() */
> > > > +			conn->seq_ack_to_tap = conn->seq_from_tap;
> > > > +
> > > > +			ret = tcp_vu_send_flag(c, conn, FIN | ACK);
> > > >  			if (ret) {
> > > >  				tcp_rst(c, conn);
> > > >  				return ret;
> > > > -- 
> > > > 2.43.0
> > > >     

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-06 22:32         ` Stefano Brivio
@ 2025-10-06 23:31           ` David Gibson
  2025-10-07 22:42             ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-06 23:31 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7017 bytes --]

On Tue, Oct 07, 2025 at 12:32:19AM +0200, Stefano Brivio wrote:
> On Fri, 3 Oct 2025 13:19:17 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote:
> > > On Thu, 2 Oct 2025 12:41:08 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:  
> > > > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
> > > > > send a FIN segment to the guest / container acknowledging a sequence
> > > > > number that's behind what we received so far, we won't have any
> > > > > further trigger to send an updated ACK segment, as we are now
> > > > > switching the epoll socket monitoring to edge-triggered mode.
> > > > > 
> > > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next
> > > > > acknowledgement sequence to the current observed sequence, regardless
> > > > > of what was acknowledged socket-side.    
> > > > 
> > > > To double check my understanding: things should work if we always
> > > > acknowledged everything we've received.  Acknowledging only what the
> > > > peer has acked is a refinement to give the guest a view that's closer
> > > > to what it would be end-to-end with the peer (which might improve the
> > > > operation of flow control).  
> > > 
> > > Right.
> > >   
> > > > We can't use that refined mechanism when the socket is closing
> > > > (amongst other cases), because while we can get the peer acked bytes
> > > > from TCP_INFO, we can't get events when that changes, so we have no
> > > > mechanism to provide updates to the guest at the right time.  So we
> > > > fall back to the simpler method.
> > > > 
> > > > Is that correct?  
> > > 
> > > Also correct, yes. If you have a better idea to summarise this in the
> > > comment in tcp_buf_data_from_sock() let me know.  
> > 
> > Hm, I might.  Or actually a way to reorganise the code that I think
> > will be a bit clearer and probably allow a clearer comment too.
> 
> I would keep reworks for a later moment. Right now, it's already taking
> me long enough to find a moment to investigate these issues, write these
> fixes, and test them.

I mean... the change I'm proposing reduces lines of code (excepting
the big new comment), makes it easier to reason about and is localised
to the immediately surrounding code.  But whatever, I don't
particularly care about the order we do things.

> > > Maybe I could mention
> > > EPOLLET explicitly there?  
> > 
> > I don't think EPOLLET is actually relevant.  Even if we had level
> > triggered events, a change in bytes_acked doesn't count as an event
> > (AFAIK).
> 
> It doesn't count, but with level-triggered events, we would be busy
> polling bytes_acked, as you noted. I was mentioning EPOLLET because it
> could be taken, intuitively, as a "stop listening for events" (almost)
> step. I'll leave that out then.

I mean, if busy polling were acceptable we could accomplish that
easily enough by doing it in tcp_defer_handler() regardless of
EPOLLET.

> > So either some other event is on, in which case we'd
> > effectively be busy polling bytes_acked, or it's not in which case we
> > don't get updates, just like now.
> > 
> > I principle we could implement some sort of timer based polling, but
> > that sounds like way more trouble than it's worth.
> 
> We already have something similar, based on ACK_INTERVAL and
> ACK_TO_TAP_DUE, and it shouldn't be overly complicated to extend that
> to a new FIN_TO_TAP_DUE flag. But indeed beyond the scope of this
> series.

Certainly.

> > > > > However, we don't necessarily call tcp_update_seqack_wnd() before
> > > > > sending the FIN segment, which might potentially lead to a situation,
> > > > > not observed in practice, where we unnecessarily cause a
> > > > > retransmission at some point after our FIN segment.
> > > > > 
> > > > > Avoid that by setting the ACK sequence to whatever we received from
> > > > > the container / guest, before sending a FIN segment and switching to
> > > > > EPOLLET.
> > > > > 
> > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>    
> > > > 
> > > > Based on my understanding above, this looks correct to me, so,
> > > > 
> > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > 
> > > > My only concern is whether we could instead insert an extra call to
> > > > tcp_update_seqack_wnd() to reduce duplicated logic.  
> > > 
> > > Hmm, maybe, but on the other hand we're closing the connection. Should
> > > we really spend time querying TCP_INFO to recalculate the window at
> > > this point? I wouldn't.  
> > 
> > Good point.  I mean tcp_update_seqack_wnd() could skip the TCP_INFO in
> > that case, but that does look a bit fiddly.
> > 
> > On the other hand, in favour of not duplicating logic...
> > 
> > [snip]
> > > > > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > >  			conn_flag(c, conn, STALLED);
> > > > >  		} else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
> > > > >  			   SOCK_FIN_RCVD) {
> > > > > -			int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
> > > > > +			int ret;
> > > > > +
> > > > > +			/* On TAP_FIN_SENT, we won't get further data events
> > > > > +			 * from the socket, and this might be the last ACK
> > > > > +			 * segment we send to the tap, so update its sequence to
> > > > > +			 * include everything we received until now.
> > > > > +			 *
> > > > > +			 * See also the special handling on CONN_IS_CLOSING() in
> > > > > +			 * tcp_update_seqack_wnd().
> > > > > +			 */
> > > > > +			conn->seq_ack_to_tap = conn->seq_from_tap;  
> > 
> > ... the equivalent bits in tcp_update_seqack_wnd() have after them:
> > 	if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
> > 		conn->seq_ack_to_tap = prev_ack_to_tap;
> > 
> > Don't we need that here as well, in case the guest is retransmitting
> > when we get the sock side FIN?
> 
> Not really, because we don't rewind conn->seq_from_tap, so we don't
> risk jumping back here.

Ah, because this one's on the sock->tap data path, whereas the other
calls are on the tap->sock data path.  Good point.

> In tcp_update_seqack_wnd(), we might jump back (that should be double
> checked eventually, I'm not sure it's still the case) if we happened to
> acknowledge more than acknowledged socket-side while handling some
> particular condition, and then we switch back to acknowledging only
> bytes_acked.
> 
> It should happen if the destination is/was a low RTT one, but we run
> out of slots in low_rtt_dst in favour of other entries. I don't
> remember any other case.


-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-06 23:31           ` David Gibson
@ 2025-10-07 22:42             ` Stefano Brivio
  2025-10-08  0:42               ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-07 22:42 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue, 7 Oct 2025 10:31:11 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Oct 07, 2025 at 12:32:19AM +0200, Stefano Brivio wrote:
> > On Fri, 3 Oct 2025 13:19:17 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote:  
> > > > On Thu, 2 Oct 2025 12:41:08 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:    
> > > > > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
> > > > > > send a FIN segment to the guest / container acknowledging a sequence
> > > > > > number that's behind what we received so far, we won't have any
> > > > > > further trigger to send an updated ACK segment, as we are now
> > > > > > switching the epoll socket monitoring to edge-triggered mode.
> > > > > > 
> > > > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next
> > > > > > acknowledgement sequence to the current observed sequence, regardless
> > > > > > of what was acknowledged socket-side.      
> > > > > 
> > > > > To double check my understanding: things should work if we always
> > > > > acknowledged everything we've received.  Acknowledging only what the
> > > > > peer has acked is a refinement to give the guest a view that's closer
> > > > > to what it would be end-to-end with the peer (which might improve the
> > > > > operation of flow control).    
> > > > 
> > > > Right.
> > > >     
> > > > > We can't use that refined mechanism when the socket is closing
> > > > > (amongst other cases), because while we can get the peer acked bytes
> > > > > from TCP_INFO, we can't get events when that changes, so we have no
> > > > > mechanism to provide updates to the guest at the right time.  So we
> > > > > fall back to the simpler method.
> > > > > 
> > > > > Is that correct?    
> > > > 
> > > > Also correct, yes. If you have a better idea to summarise this in the
> > > > comment in tcp_buf_data_from_sock() let me know.    
> > > 
> > > Hm, I might.  Or actually a way to reorganise the code that I think
> > > will be a bit clearer and probably allow a clearer comment too.  
> > 
> > I would keep reworks for a later moment. Right now, it's already taking
> > me long enough to find a moment to investigate these issues, write these
> > fixes, and test them.  
> 
> I mean... the change I'm proposing reduces lines of code (excepting
> the big new comment), makes it easier to reason about and is localised
> to the immediately surrounding code.  But whatever, I don't
> particularly care about the order we do things.

Sure, I don't think that other patch is particularly complicated or
might ever become problematic at all, but, from your earlier comment
("reorganise the code", as I mentioned the comment in
tcp_buf_data_from_sock()), I thought you wanted to rework this
particular code in tcp_buf_data_from_sock() at the same time.

I guess it's not the case judging from your most recent reply, though.

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap
  2025-10-07 22:42             ` Stefano Brivio
@ 2025-10-08  0:42               ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-10-08  0:42 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3605 bytes --]

On Wed, Oct 08, 2025 at 12:42:37AM +0200, Stefano Brivio wrote:
> On Tue, 7 Oct 2025 10:31:11 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Oct 07, 2025 at 12:32:19AM +0200, Stefano Brivio wrote:
> > > On Fri, 3 Oct 2025 13:19:17 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote:  
> > > > > On Thu, 2 Oct 2025 12:41:08 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:    
> > > > > > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
> > > > > > > send a FIN segment to the guest / container acknowledging a sequence
> > > > > > > number that's behind what we received so far, we won't have any
> > > > > > > further trigger to send an updated ACK segment, as we are now
> > > > > > > switching the epoll socket monitoring to edge-triggered mode.
> > > > > > > 
> > > > > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next
> > > > > > > acknowledgement sequence to the current observed sequence, regardless
> > > > > > > of what was acknowledged socket-side.      
> > > > > > 
> > > > > > To double check my understanding: things should work if we always
> > > > > > acknowledged everything we've received.  Acknowledging only what the
> > > > > > peer has acked is a refinement to give the guest a view that's closer
> > > > > > to what it would be end-to-end with the peer (which might improve the
> > > > > > operation of flow control).    
> > > > > 
> > > > > Right.
> > > > >     
> > > > > > We can't use that refined mechanism when the socket is closing
> > > > > > (amongst other cases), because while we can get the peer acked bytes
> > > > > > from TCP_INFO, we can't get events when that changes, so we have no
> > > > > > mechanism to provide updates to the guest at the right time.  So we
> > > > > > fall back to the simpler method.
> > > > > > 
> > > > > > Is that correct?    
> > > > > 
> > > > > Also correct, yes. If you have a better idea to summarise this in the
> > > > > comment in tcp_buf_data_from_sock() let me know.    
> > > > 
> > > > Hm, I might.  Or actually a way to reorganise the code that I think
> > > > will be a bit clearer and probably allow a clearer comment too.  
> > > 
> > > I would keep reworks for a later moment. Right now, it's already taking
> > > me long enough to find a moment to investigate these issues, write these
> > > fixes, and test them.  
> > 
> > I mean... the change I'm proposing reduces lines of code (excepting
> > the big new comment), makes it easier to reason about and is localised
> > to the immediately surrounding code.  But whatever, I don't
> > particularly care about the order we do things.
> 
> Sure, I don't think that other patch is particularly complicated or
> might ever become problematic at all, but, from your earlier comment
> ("reorganise the code", as I mentioned the comment in
> tcp_buf_data_from_sock()), I thought you wanted to rework this
> particular code in tcp_buf_data_from_sock() at the same time.

Yeah, I guess "reorganise" gives an impression of a wider ranging
change than it actually was.

> I guess it's not the case judging from your most recent reply, though.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message
  2025-10-02  0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio
  2025-10-02  0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio
@ 2025-10-02  0:06 ` Stefano Brivio
  2025-10-02  2:44   ` David Gibson
  2025-10-02  0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio
  2025-10-02  0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio
  3 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-02  0:06 UTC (permalink / raw)
  To: passt-dev

According to RFC 9293 we should ignore data (note: not data segments)
in CLOSE-WAIT state (indicated by TAP_FIN_RCVD), see 3.10.7.4
"Other states":

  [...]

  Seventh, process the segment text:

  [...]

  CLOSE-WAIT STATE

  This should not occur since a FIN has been received from the remote
  side. Ignore the segment text.

and we almost do that, except that we would look at the data length
to decide whether it's a request for fast re-transmission, so fix
that, and while at it, log a message, so that cases such as the
following one are more apparent in debug logs:

28692   0.009758 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [FIN, ACK] Seq=121441 Ack=141 Win=65536 Len=0

we should ignore this FIN flag, because we didn't accept data up
to this sequence (see next segment), but we don't do it, so, here:

28693   0.000036 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90722 Win=32128 Len=0
28694   0.034597 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [FIN, ACK] Seq=141 Ack=90722 Win=121216 Len=0
28695   0.000019 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [ACK] Seq=121442 Ack=142 Win=65536 Len=0
28696   0.162968 88.198.0.164 → 93.235.151.95 30773 TCP [TCP Retransmission] 55414 → 47080 [FIN, PSH, ACK] Seq=90722 Ack=142 Win=65536 Len=30719 [TCP segment of a reassembled PDU]

we are erroneously in CLOSE-WAIT (TAP_FIN_RCVD) state, and this
segment would look pretty strange there.

This specific case is fixed by the next patch, so it should never
happen again.

Link: https://archives.passt.top/passt-dev/20250910115726.432bbb8d@elisabeth/
Link: https://bugs.passt.top/show_bug.cgi?id=126
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 48b1ef2..3f7dc82 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2130,9 +2130,15 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	/* Established connections not accepting data from tap */
 	if (conn->events & TAP_FIN_RCVD) {
+		size_t dlen;
 		bool retr;
 
-		retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin &&
+		if ((dlen = tcp_packet_data_len(th, l4len))) {
+			flow_dbg(conn, "data segment in CLOSE-WAIT (%zu B)",
+				 dlen);
+		}
+
+		retr = th->ack && !th->fin &&
 		       ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
 		       ntohs(th->window) == conn->wnd_from_tap;
 
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message
  2025-10-02  0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio
@ 2025-10-02  2:44   ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-10-02  2:44 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3030 bytes --]

On Thu, Oct 02, 2025 at 02:06:44AM +0200, Stefano Brivio wrote:
> According to RFC 9293 we should ignore data (note: not data segments)
> in CLOSE-WAIT state (indicated by TAP_FIN_RCVD), see 3.10.7.4
> "Other states":
> 
>   [...]
> 
>   Seventh, process the segment text:
> 
>   [...]
> 
>   CLOSE-WAIT STATE
> 
>   This should not occur since a FIN has been received from the remote
>   side. Ignore the segment text.
> 
> and we almost do that, except that we would look at the data length
> to decide whether it's a request for fast re-transmission, so fix
> that, and while at it, log a message, so that cases such as the
> following one are more apparent in debug logs:
> 
> 28692   0.009758 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [FIN, ACK] Seq=121441 Ack=141 Win=65536 Len=0
> 
> we should ignore this FIN flag, because we didn't accept data up
> to this sequence (see next segment), but we don't do it, so, here:
> 
> 28693   0.000036 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90722 Win=32128 Len=0
> 28694   0.034597 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [FIN, ACK] Seq=141 Ack=90722 Win=121216 Len=0
> 28695   0.000019 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [ACK] Seq=121442 Ack=142 Win=65536 Len=0
> 28696   0.162968 88.198.0.164 → 93.235.151.95 30773 TCP [TCP Retransmission] 55414 → 47080 [FIN, PSH, ACK] Seq=90722 Ack=142 Win=65536 Len=30719 [TCP segment of a reassembled PDU]
> 
> we are erroneously in CLOSE-WAIT (TAP_FIN_RCVD) state, and this
> segment would look pretty strange there.
> 
> This specific case is fixed by the next patch, so it should never
> happen again.
> 
> Link: https://archives.passt.top/passt-dev/20250910115726.432bbb8d@elisabeth/
> Link: https://bugs.passt.top/show_bug.cgi?id=126
> Suggested-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 48b1ef2..3f7dc82 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2130,9 +2130,15 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>  
>  	/* Established connections not accepting data from tap */
>  	if (conn->events & TAP_FIN_RCVD) {
> +		size_t dlen;
>  		bool retr;
>  
> -		retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin &&
> +		if ((dlen = tcp_packet_data_len(th, l4len))) {
> +			flow_dbg(conn, "data segment in CLOSE-WAIT (%zu B)",
> +				 dlen);
> +		}
> +
> +		retr = th->ack && !th->fin &&
>  		       ntohl(th->ack_seq) == conn->seq_ack_from_tap &&
>  		       ntohs(th->window) == conn->wnd_from_tap;
>  
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-02  0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio
  2025-10-02  0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio
  2025-10-02  0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio
@ 2025-10-02  0:06 ` Stefano Brivio
  2025-10-02  2:52   ` David Gibson
  2025-10-02  0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio
  3 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-02  0:06 UTC (permalink / raw)
  To: passt-dev

If a guest or container sends us a FIN segment but its sequence number
doesn't match the highest sequence of data we *accepted* (not
necessarily the highest sequence we received), that is,
conn->seq_from_tap, plus any data we're accepting in the current
batch, we should discard the flag (not necessarily the segment),
because there's still data we need to receive (again) before the end
of the stream.

If we consider those FIN flags as such, we'll end up in the
situation described below.

Here, 192.168.10.102 is a HTTP server in a Podman container, and
192.168.10.44 is a client fetching approximately 121 KB of data from
it:

   82   2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0

the server is done sending

   83   2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0

pasta (client) acknowledges a previous sequence, because of
a short sendmsg()

   84   2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0

pasta (client) sends FIN, ACK as the client has no more data to
send (a single GET request), while still acknowledging a previous
sequence, because the retransmission didn't happen yet

   85   2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0

the server acknowledges the FIN, ACK

   86   2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU]

and finally a retransmission comes, but as we wrongly switched to
the CLOSE-WAIT state,

   87   2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0

we consider frame #86 as an acknowledgement for the FIN segment we
sent, and close the connection, while we still had to re-receive
(and finally send) the missing data segment, instead.

Link: https://github.com/containers/podman/issues/27179
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 3f7dc82..5a7a607 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			}
 		}
 
-		if (th->fin)
+		if (th->fin && seq == seq_from_tap)
 			fin = 1;
 
 		if (!len)
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-02  0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio
@ 2025-10-02  2:52   ` David Gibson
  2025-10-02  3:02     ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-02  2:52 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3231 bytes --]

On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote:
> If a guest or container sends us a FIN segment but its sequence number
> doesn't match the highest sequence of data we *accepted* (not
> necessarily the highest sequence we received), that is,
> conn->seq_from_tap, plus any data we're accepting in the current
> batch, we should discard the flag (not necessarily the segment),
> because there's still data we need to receive (again) before the end
> of the stream.
> 
> If we consider those FIN flags as such, we'll end up in the
> situation described below.
> 
> Here, 192.168.10.102 is a HTTP server in a Podman container, and
> 192.168.10.44 is a client fetching approximately 121 KB of data from
> it:
> 
>    82   2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0
> 
> the server is done sending
> 
>    83   2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0
> 
> pasta (client) acknowledges a previous sequence, because of
> a short sendmsg()
> 
>    84   2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0
> 
> pasta (client) sends FIN, ACK as the client has no more data to
> send (a single GET request), while still acknowledging a previous
> sequence, because the retransmission didn't happen yet
> 
>    85   2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0
> 
> the server acknowledges the FIN, ACK
> 
>    86   2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU]
> 
> and finally a retransmission comes, but as we wrongly switched to
> the CLOSE-WAIT state,
> 
>    87   2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0
> 
> we consider frame #86 as an acknowledgement for the FIN segment we
> sent, and close the connection, while we still had to re-receive
> (and finally send) the missing data segment, instead.
> 
> Link: https://github.com/containers/podman/issues/27179
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 3f7dc82..5a7a607 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  			}
>  		}
>  
> -		if (th->fin)
> +		if (th->fin && seq == seq_from_tap)
>  			fin = 1;

Can a FIN segment also contain data?  My quick googling suggests yes.
If so, doesn't this logic need to go after we process the data
processing, so that seq_from_tap points to the end of the packet's
data, rather than the beginning?  (And the handling of zero-length
packets would also need revision to match).

>  
>  		if (!len)
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-02  2:52   ` David Gibson
@ 2025-10-02  3:02     ` David Gibson
  2025-10-02 11:51       ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-02  3:02 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3657 bytes --]

On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote:
> On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote:
> > If a guest or container sends us a FIN segment but its sequence number
> > doesn't match the highest sequence of data we *accepted* (not
> > necessarily the highest sequence we received), that is,
> > conn->seq_from_tap, plus any data we're accepting in the current
> > batch, we should discard the flag (not necessarily the segment),
> > because there's still data we need to receive (again) before the end
> > of the stream.
> > 
> > If we consider those FIN flags as such, we'll end up in the
> > situation described below.
> > 
> > Here, 192.168.10.102 is a HTTP server in a Podman container, and
> > 192.168.10.44 is a client fetching approximately 121 KB of data from
> > it:
> > 
> >    82   2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0
> > 
> > the server is done sending
> > 
> >    83   2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0
> > 
> > pasta (client) acknowledges a previous sequence, because of
> > a short sendmsg()
> > 
> >    84   2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0
> > 
> > pasta (client) sends FIN, ACK as the client has no more data to
> > send (a single GET request), while still acknowledging a previous
> > sequence, because the retransmission didn't happen yet
> > 
> >    85   2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0
> > 
> > the server acknowledges the FIN, ACK
> > 
> >    86   2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU]
> > 
> > and finally a retransmission comes, but as we wrongly switched to
> > the CLOSE-WAIT state,
> > 
> >    87   2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0
> > 
> > we consider frame #86 as an acknowledgement for the FIN segment we
> > sent, and close the connection, while we still had to re-receive
> > (and finally send) the missing data segment, instead.
> > 
> > Link: https://github.com/containers/podman/issues/27179
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 3f7dc82..5a7a607 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >  			}
> >  		}
> >  
> > -		if (th->fin)
> > +		if (th->fin && seq == seq_from_tap)
> >  			fin = 1;
> 
> Can a FIN segment also contain data?  My quick googling suggests yes.
> If so, doesn't this logic need to go after we process the data
> processing, so that seq_from_tap points to the end of the packet's
> data, rather than the beginning?  (And the handling of zero-length
> packets would also need revision to match).

Following on from that, it seems to me like it would make sense for
FIN segments to also participate in the 'keep' mechanism.  It should
work eventually, but I expect it would be smoother in the case that we
get a final burst of packets in a stream out of order.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-02  3:02     ` David Gibson
@ 2025-10-02 11:51       ` Stefano Brivio
  2025-10-03  3:43         ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-02 11:51 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 2 Oct 2025 13:02:09 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote:
> > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote:  
> > > If a guest or container sends us a FIN segment but its sequence number
> > > doesn't match the highest sequence of data we *accepted* (not
> > > necessarily the highest sequence we received), that is,
> > > conn->seq_from_tap, plus any data we're accepting in the current
> > > batch, we should discard the flag (not necessarily the segment),
> > > because there's still data we need to receive (again) before the end
> > > of the stream.
> > > 
> > > If we consider those FIN flags as such, we'll end up in the
> > > situation described below.
> > > 
> > > Here, 192.168.10.102 is a HTTP server in a Podman container, and
> > > 192.168.10.44 is a client fetching approximately 121 KB of data from
> > > it:
> > > 
> > >    82   2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0
> > > 
> > > the server is done sending
> > > 
> > >    83   2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0
> > > 
> > > pasta (client) acknowledges a previous sequence, because of
> > > a short sendmsg()
> > > 
> > >    84   2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0
> > > 
> > > pasta (client) sends FIN, ACK as the client has no more data to
> > > send (a single GET request), while still acknowledging a previous
> > > sequence, because the retransmission didn't happen yet
> > > 
> > >    85   2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0
> > > 
> > > the server acknowledges the FIN, ACK
> > > 
> > >    86   2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU]
> > > 
> > > and finally a retransmission comes, but as we wrongly switched to
> > > the CLOSE-WAIT state,
> > > 
> > >    87   2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0
> > > 
> > > we consider frame #86 as an acknowledgement for the FIN segment we
> > > sent, and close the connection, while we still had to re-receive
> > > (and finally send) the missing data segment, instead.
> > > 
> > > Link: https://github.com/containers/podman/issues/27179
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  tcp.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index 3f7dc82..5a7a607 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  			}
> > >  		}
> > >  
> > > -		if (th->fin)
> > > +		if (th->fin && seq == seq_from_tap)
> > >  			fin = 1;  
> > 
> > Can a FIN segment also contain data?  My quick googling suggests yes.

Yes, absolutely, my slow wiresharking over the years also confirms, and
it's so often the case that (I think) this issue doesn't happen so
frequently as it only occurs if we have a FIN segment without data.

If we have a data segment, with FIN set, that we can't fully transmit,
we already set 'partial_send' and won't set TAP_FIN_RCVD as a result.

Another case where we want to ignore a FIN segment with data is if we
have a gap before it, but in that case we'll eventually set 'keep' and
return early.

> > If so, doesn't this logic need to go after we process the data
> > processing, so that seq_from_tap points to the end of the packet's
> > data, rather than the beginning?  (And the handling of zero-length
> > packets would also need revision to match).  

This made sense to me for a moment but now I'm struggling to understand
or remember why. What I want to check here is that a FIN segment
without data (I should have specified in the commit message) is
acceptable because its sequence is as expected.

But going back to FIN segments with data: why should we sum the length
to seq_from_tap before comparing the sequence? I don't understand what
additional check you want to introduce, or what case you want to cover.

> Following on from that, it seems to me like it would make sense for
> FIN segments to also participate in the 'keep' mechanism.  It should
> work eventually, but I expect it would be smoother in the case that we
> get a final burst of packets in a stream out of order.

FIN segments with data already go through that dance.

Without data, I guess you're right, we might have in the same batch
(not that I've ever seen it happening in practice) a FIN segment
without data that we process first (and now discard because of the
sequence number), and some data before that we process later, so we
shouldn't throw away the FIN segment because of that. We should,
conceptually, reorder it as well.

It probably makes things more complicated for a reason that's not so
critical (ignoring a FIN is fine, we'll get another one), and I wanted
to have the simplest possible fix here.

Let me see if I can make this entirely correct without a substantially
bigger change, I haven't really tried.

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-02 11:51       ` Stefano Brivio
@ 2025-10-03  3:43         ` David Gibson
  2025-10-06 22:32           ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-03  3:43 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4869 bytes --]

On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote:
> On Thu, 2 Oct 2025 13:02:09 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote:
> > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote:
[snip]
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 3f7dc82..5a7a607 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > >  			}
> > > >  		}
> > > >  
> > > > -		if (th->fin)
> > > > +		if (th->fin && seq == seq_from_tap)
> > > >  			fin = 1;  
> > > 
> > > Can a FIN segment also contain data?  My quick googling suggests yes.
> 
> Yes, absolutely, my slow wiresharking over the years also confirms, and
> it's so often the case that (I think) this issue doesn't happen so
> frequently as it only occurs if we have a FIN segment without data.

Makes sense.

> If we have a data segment, with FIN set, that we can't fully transmit,
> we already set 'partial_send' and won't set TAP_FIN_RCVD as a result.
> 
> Another case where we want to ignore a FIN segment with data is if we
> have a gap before it, but in that case we'll eventually set 'keep' and
> return early.

Ah, right.  I'd noticed we set fin = 1 in that case, but forgotten
about the exit before setting TAP_FIN_RCVD if keep is set.

> > > If so, doesn't this logic need to go after we process the data
> > > processing, so that seq_from_tap points to the end of the packet's
> > > data, rather than the beginning?  (And the handling of zero-length
> > > packets would also need revision to match).  
> 
> This made sense to me for a moment but now I'm struggling to understand
> or remember why. What I want to check here is that a FIN segment
> without data (I should have specified in the commit message) is
> acceptable because its sequence is as expected.

Right.  This is correct for zero-data FIN segments, but I think as a
side-effect you've made it ignore certain FIN segments _with_ data.
It will work in the common case where the data exactly follows on from
what we already have.  But in the case where the segment has some data
we already have and some new data, the fin = 1 won't trip because seq
!= seq_from_tap.  There isn't another place that will catch it
instead, AFAICT.

I guess it will be fine in the end, because with all the data acked,
the guest should retransmit the FIN with no data.

> But going back to FIN segments with data: why should we sum the length
> to seq_from_tap before comparing the sequence? I don't understand what
> additional check you want to introduce, or what case you want to cover.

I was thinking about the case above, but I didn't explain it very
well.

> > Following on from that, it seems to me like it would make sense for
> > FIN segments to also participate in the 'keep' mechanism.  It should
> > work eventually, but I expect it would be smoother in the case that we
> > get a final burst of packets in a stream out of order.
> 
> FIN segments with data already go through that dance.
> 
> Without data, I guess you're right, we might have in the same batch
> (not that I've ever seen it happening in practice) a FIN segment
> without data that we process first (and now discard because of the
> sequence number), and some data before that we process later, so we
> shouldn't throw away the FIN segment because of that. We should,
> conceptually, reorder it as well.
> 
> It probably makes things more complicated for a reason that's not so
> critical (ignoring a FIN is fine, we'll get another one), and I wanted
> to have the simplest possible fix here.
> 
> Let me see if I can make this entirely correct without a substantially
> bigger change, I haven't really tried.

How about this:

diff --git a/tcp.c b/tcp.c
index 7da41797..42e576b4 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			}
 		}
 
-		if (th->fin)
-			fin = 1;
-
-		if (!len)
+		if (!len && !th->fin)
 			continue;
 
 		seq_offset = seq_from_tap - seq;
@@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 			break;
 		seq_from_tap += size;
 		iov_i += count;
+		if (th->fin)
+			fin = 1;
 
 		if (keep == i)
 			keep = -1;


We'd need to double check that the "accept data segment" path is safe
with len == 0, of course.  But I think that will treat dataless and
with-data FINs the same way, and let them use the keep mechanism.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-03  3:43         ` David Gibson
@ 2025-10-06 22:32           ` Stefano Brivio
  2025-10-06 23:34             ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-06 22:32 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 3 Oct 2025 13:43:32 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote:
> > On Thu, 2 Oct 2025 13:02:09 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:  
> > > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote:  
> > > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote:  
> [snip]
> > > > > diff --git a/tcp.c b/tcp.c
> > > > > index 3f7dc82..5a7a607 100644
> > > > > --- a/tcp.c
> > > > > +++ b/tcp.c
> > > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > > >  			}
> > > > >  		}
> > > > >  
> > > > > -		if (th->fin)
> > > > > +		if (th->fin && seq == seq_from_tap)
> > > > >  			fin = 1;    
> > > > 
> > > > Can a FIN segment also contain data?  My quick googling suggests yes.  
> > 
> > Yes, absolutely, my slow wiresharking over the years also confirms, and
> > it's so often the case that (I think) this issue doesn't happen so
> > frequently as it only occurs if we have a FIN segment without data.  
> 
> Makes sense.
> 
> > If we have a data segment, with FIN set, that we can't fully transmit,
> > we already set 'partial_send' and won't set TAP_FIN_RCVD as a result.
> > 
> > Another case where we want to ignore a FIN segment with data is if we
> > have a gap before it, but in that case we'll eventually set 'keep' and
> > return early.  
> 
> Ah, right.  I'd noticed we set fin = 1 in that case, but forgotten
> about the exit before setting TAP_FIN_RCVD if keep is set.
> 
> > > > If so, doesn't this logic need to go after we process the data
> > > > processing, so that seq_from_tap points to the end of the packet's
> > > > data, rather than the beginning?  (And the handling of zero-length
> > > > packets would also need revision to match).    
> > 
> > This made sense to me for a moment but now I'm struggling to understand
> > or remember why. What I want to check here is that a FIN segment
> > without data (I should have specified in the commit message) is
> > acceptable because its sequence is as expected.  
> 
> Right.  This is correct for zero-data FIN segments, but I think as a
> side-effect you've made it ignore certain FIN segments _with_ data.
> It will work in the common case where the data exactly follows on from
> what we already have.  But in the case where the segment has some data
> we already have and some new data, the fin = 1 won't trip because seq
> != seq_from_tap.  There isn't another place that will catch it
> instead, AFAICT.
> 
> I guess it will be fine in the end, because with all the data acked,
> the guest should retransmit the FIN with no data.
> 
> > But going back to FIN segments with data: why should we sum the length
> > to seq_from_tap before comparing the sequence? I don't understand what
> > additional check you want to introduce, or what case you want to cover.  
> 
> I was thinking about the case above, but I didn't explain it very
> well.
> 
> > > Following on from that, it seems to me like it would make sense for
> > > FIN segments to also participate in the 'keep' mechanism.  It should
> > > work eventually, but I expect it would be smoother in the case that we
> > > get a final burst of packets in a stream out of order.  
> > 
> > FIN segments with data already go through that dance.
> > 
> > Without data, I guess you're right, we might have in the same batch
> > (not that I've ever seen it happening in practice) a FIN segment
> > without data that we process first (and now discard because of the
> > sequence number), and some data before that we process later, so we
> > shouldn't throw away the FIN segment because of that. We should,
> > conceptually, reorder it as well.
> > 
> > It probably makes things more complicated for a reason that's not so
> > critical (ignoring a FIN is fine, we'll get another one), and I wanted
> > to have the simplest possible fix here.
> > 
> > Let me see if I can make this entirely correct without a substantially
> > bigger change, I haven't really tried.  
> 
> How about this:
> 
> diff --git a/tcp.c b/tcp.c
> index 7da41797..42e576b4 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  			}
>  		}
>  
> -		if (th->fin)
> -			fin = 1;
> -
> -		if (!len)
> +		if (!len && !th->fin)
>  			continue;
>  
>  		seq_offset = seq_from_tap - seq;
> @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
>  			break;
>  		seq_from_tap += size;
>  		iov_i += count;
> +		if (th->fin)
> +			fin = 1;
>  
>  		if (keep == i)
>  			keep = -1;
> 
> 
> We'd need to double check that the "accept data segment" path is safe
> with len == 0, of course.

For sure it's not before d2c33f45f7be ("tcp: Convert
tcp_data_from_tap() to use iov_tail"), because we might add
zero-length segments to the tcp_iov array, and that would make
backporting an otherwise simple and critical fix to slightly older
versions rather complicated.

After that commit, I'm not sure about the behaviour of
iov_tail_clone(). I think it will return 0, but it should be tested
(that assumption, itself, but also that the fix still works).

> But I think that will treat dataless and
> with-data FINs the same way, and let them use the keep mechanism.

Given that the only advantage of doing this would be to handle a rare
(I guess) corner case, that is, an out-of-sequence FIN segment with
data, which is not critical anyway because the FIN will be
retransmitted, I would rather keep this (critical) fix as it is.

I would suggest filing a separate ticket or anyway sending a separate
patch if you want to fix that other case.

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-06 22:32           ` Stefano Brivio
@ 2025-10-06 23:34             ` David Gibson
  2025-10-07 22:42               ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-06 23:34 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6763 bytes --]

On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote:
> On Fri, 3 Oct 2025 13:43:32 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote:
> > > On Thu, 2 Oct 2025 13:02:09 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:  
> > > > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote:  
> > > > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote:  
> > [snip]
> > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > index 3f7dc82..5a7a607 100644
> > > > > > --- a/tcp.c
> > > > > > +++ b/tcp.c
> > > > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > > > >  			}
> > > > > >  		}
> > > > > >  
> > > > > > -		if (th->fin)
> > > > > > +		if (th->fin && seq == seq_from_tap)
> > > > > >  			fin = 1;    
> > > > > 
> > > > > Can a FIN segment also contain data?  My quick googling suggests yes.  
> > > 
> > > Yes, absolutely, my slow wiresharking over the years also confirms, and
> > > it's so often the case that (I think) this issue doesn't happen so
> > > frequently as it only occurs if we have a FIN segment without data.  
> > 
> > Makes sense.
> > 
> > > If we have a data segment, with FIN set, that we can't fully transmit,
> > > we already set 'partial_send' and won't set TAP_FIN_RCVD as a result.
> > > 
> > > Another case where we want to ignore a FIN segment with data is if we
> > > have a gap before it, but in that case we'll eventually set 'keep' and
> > > return early.  
> > 
> > Ah, right.  I'd noticed we set fin = 1 in that case, but forgotten
> > about the exit before setting TAP_FIN_RCVD if keep is set.
> > 
> > > > > If so, doesn't this logic need to go after we process the data
> > > > > processing, so that seq_from_tap points to the end of the packet's
> > > > > data, rather than the beginning?  (And the handling of zero-length
> > > > > packets would also need revision to match).    
> > > 
> > > This made sense to me for a moment but now I'm struggling to understand
> > > or remember why. What I want to check here is that a FIN segment
> > > without data (I should have specified in the commit message) is
> > > acceptable because its sequence is as expected.  
> > 
> > Right.  This is correct for zero-data FIN segments, but I think as a
> > side-effect you've made it ignore certain FIN segments _with_ data.
> > It will work in the common case where the data exactly follows on from
> > what we already have.  But in the case where the segment has some data
> > we already have and some new data, the fin = 1 won't trip because seq
> > != seq_from_tap.  There isn't another place that will catch it
> > instead, AFAICT.
> > 
> > I guess it will be fine in the end, because with all the data acked,
> > the guest should retransmit the FIN with no data.
> > 
> > > But going back to FIN segments with data: why should we sum the length
> > > to seq_from_tap before comparing the sequence? I don't understand what
> > > additional check you want to introduce, or what case you want to cover.  
> > 
> > I was thinking about the case above, but I didn't explain it very
> > well.
> > 
> > > > Following on from that, it seems to me like it would make sense for
> > > > FIN segments to also participate in the 'keep' mechanism.  It should
> > > > work eventually, but I expect it would be smoother in the case that we
> > > > get a final burst of packets in a stream out of order.  
> > > 
> > > FIN segments with data already go through that dance.
> > > 
> > > Without data, I guess you're right, we might have in the same batch
> > > (not that I've ever seen it happening in practice) a FIN segment
> > > without data that we process first (and now discard because of the
> > > sequence number), and some data before that we process later, so we
> > > shouldn't throw away the FIN segment because of that. We should,
> > > conceptually, reorder it as well.
> > > 
> > > It probably makes things more complicated for a reason that's not so
> > > critical (ignoring a FIN is fine, we'll get another one), and I wanted
> > > to have the simplest possible fix here.
> > > 
> > > Let me see if I can make this entirely correct without a substantially
> > > bigger change, I haven't really tried.  
> > 
> > How about this:
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 7da41797..42e576b4 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >  			}
> >  		}
> >  
> > -		if (th->fin)
> > -			fin = 1;
> > -
> > -		if (!len)
> > +		if (!len && !th->fin)
> >  			continue;
> >  
> >  		seq_offset = seq_from_tap - seq;
> > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> >  			break;
> >  		seq_from_tap += size;
> >  		iov_i += count;
> > +		if (th->fin)
> > +			fin = 1;
> >  
> >  		if (keep == i)
> >  			keep = -1;
> > 
> > 
> > We'd need to double check that the "accept data segment" path is safe
> > with len == 0, of course.
> 
> For sure it's not before d2c33f45f7be ("tcp: Convert
> tcp_data_from_tap() to use iov_tail"), because we might add
> zero-length segments to the tcp_iov array, and that would make
> backporting an otherwise simple and critical fix to slightly older
> versions rather complicated.

Kinda.  It's not that complicated to deal with that case, by wrapping
the actual data processing in an `if (len) { ... }`

> After that commit, I'm not sure about the behaviour of
> iov_tail_clone(). I think it will return 0, but it should be tested
> (that assumption, itself, but also that the fix still works).

Right, I think it's safe in that case - that's what I was looking at.

> > But I think that will treat dataless and
> > with-data FINs the same way, and let them use the keep mechanism.
> 
> Given that the only advantage of doing this would be to handle a rare
> (I guess) corner case, that is, an out-of-sequence FIN segment with
> data, which is not critical anyway because the FIN will be
> retransmitted, I would rather keep this (critical) fix as it is.
> 
> I would suggest filing a separate ticket or anyway sending a separate
> patch if you want to fix that other case.

Fair enough.  I'll wait for you to get the basic fix merge, then I
might batch this cleanup/fixup with the reorg to clarify bytes_acked
handling.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-06 23:34             ` David Gibson
@ 2025-10-07 22:42               ` Stefano Brivio
  2025-10-08  0:41                 ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-07 22:42 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue, 7 Oct 2025 10:34:01 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote:
> > On Fri, 3 Oct 2025 13:43:32 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote:  
> > > > On Thu, 2 Oct 2025 13:02:09 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:    
> > > > > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote:    
> > > > > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote:    
> > > [snip]  
> > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > index 3f7dc82..5a7a607 100644
> > > > > > > --- a/tcp.c
> > > > > > > +++ b/tcp.c
> > > > > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > > > > >  			}
> > > > > > >  		}
> > > > > > >  
> > > > > > > -		if (th->fin)
> > > > > > > +		if (th->fin && seq == seq_from_tap)
> > > > > > >  			fin = 1;      
> > > > > > 
> > > > > > Can a FIN segment also contain data?  My quick googling suggests yes.    
> > > > 
> > > > Yes, absolutely, my slow wiresharking over the years also confirms, and
> > > > it's so often the case that (I think) this issue doesn't happen so
> > > > frequently as it only occurs if we have a FIN segment without data.    
> > > 
> > > Makes sense.
> > >   
> > > > If we have a data segment, with FIN set, that we can't fully transmit,
> > > > we already set 'partial_send' and won't set TAP_FIN_RCVD as a result.
> > > > 
> > > > Another case where we want to ignore a FIN segment with data is if we
> > > > have a gap before it, but in that case we'll eventually set 'keep' and
> > > > return early.    
> > > 
> > > Ah, right.  I'd noticed we set fin = 1 in that case, but forgotten
> > > about the exit before setting TAP_FIN_RCVD if keep is set.
> > >   
> > > > > > If so, doesn't this logic need to go after we process the data
> > > > > > processing, so that seq_from_tap points to the end of the packet's
> > > > > > data, rather than the beginning?  (And the handling of zero-length
> > > > > > packets would also need revision to match).      
> > > > 
> > > > This made sense to me for a moment but now I'm struggling to understand
> > > > or remember why. What I want to check here is that a FIN segment
> > > > without data (I should have specified in the commit message) is
> > > > acceptable because its sequence is as expected.    
> > > 
> > > Right.  This is correct for zero-data FIN segments, but I think as a
> > > side-effect you've made it ignore certain FIN segments _with_ data.
> > > It will work in the common case where the data exactly follows on from
> > > what we already have.  But in the case where the segment has some data
> > > we already have and some new data, the fin = 1 won't trip because seq
> > > != seq_from_tap.  There isn't another place that will catch it
> > > instead, AFAICT.
> > > 
> > > I guess it will be fine in the end, because with all the data acked,
> > > the guest should retransmit the FIN with no data.
> > >   
> > > > But going back to FIN segments with data: why should we sum the length
> > > > to seq_from_tap before comparing the sequence? I don't understand what
> > > > additional check you want to introduce, or what case you want to cover.    
> > > 
> > > I was thinking about the case above, but I didn't explain it very
> > > well.
> > >   
> > > > > Following on from that, it seems to me like it would make sense for
> > > > > FIN segments to also participate in the 'keep' mechanism.  It should
> > > > > work eventually, but I expect it would be smoother in the case that we
> > > > > get a final burst of packets in a stream out of order.    
> > > > 
> > > > FIN segments with data already go through that dance.
> > > > 
> > > > Without data, I guess you're right, we might have in the same batch
> > > > (not that I've ever seen it happening in practice) a FIN segment
> > > > without data that we process first (and now discard because of the
> > > > sequence number), and some data before that we process later, so we
> > > > shouldn't throw away the FIN segment because of that. We should,
> > > > conceptually, reorder it as well.
> > > > 
> > > > It probably makes things more complicated for a reason that's not so
> > > > critical (ignoring a FIN is fine, we'll get another one), and I wanted
> > > > to have the simplest possible fix here.
> > > > 
> > > > Let me see if I can make this entirely correct without a substantially
> > > > bigger change, I haven't really tried.    
> > > 
> > > How about this:
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index 7da41797..42e576b4 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  			}
> > >  		}
> > >  
> > > -		if (th->fin)
> > > -			fin = 1;
> > > -
> > > -		if (!len)
> > > +		if (!len && !th->fin)
> > >  			continue;
> > >  
> > >  		seq_offset = seq_from_tap - seq;
> > > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  			break;
> > >  		seq_from_tap += size;
> > >  		iov_i += count;
> > > +		if (th->fin)
> > > +			fin = 1;
> > >  
> > >  		if (keep == i)
> > >  			keep = -1;
> > > 
> > > 
> > > We'd need to double check that the "accept data segment" path is safe
> > > with len == 0, of course.  
> > 
> > For sure it's not before d2c33f45f7be ("tcp: Convert
> > tcp_data_from_tap() to use iov_tail"), because we might add
> > zero-length segments to the tcp_iov array, and that would make
> > backporting an otherwise simple and critical fix to slightly older
> > versions rather complicated.  
> 
> Kinda.  It's not that complicated to deal with that case, by wrapping
> the actual data processing in an `if (len) { ... }`

That's needed for sure, but do we risk looping forever on a particular
batch of FIN segments without data with a given series of sequence
numbers?

Right now the loop terminates because the sequence moves forward. I'm
not sure what happens without data while moving 'keep' around. Maybe it
takes a few minutes to figure out (I haven't tried) but I wouldn't call
that trivial.

> > After that commit, I'm not sure about the behaviour of
> > iov_tail_clone(). I think it will return 0, but it should be tested
> > (that assumption, itself, but also that the fix still works).  
> 
> Right, I think it's safe in that case - that's what I was looking at.

Note that I didn't test this.

> > > But I think that will treat dataless and
> > > with-data FINs the same way, and let them use the keep mechanism.  
> > 
> > Given that the only advantage of doing this would be to handle a rare
> > (I guess) corner case, that is, an out-of-sequence FIN segment with
> > data, which is not critical anyway because the FIN will be
> > retransmitted, I would rather keep this (critical) fix as it is.
> > 
> > I would suggest filing a separate ticket or anyway sending a separate
> > patch if you want to fix that other case.  
> 
> Fair enough.  I'll wait for you to get the basic fix merge, then I
> might batch this cleanup/fixup with the reorg to clarify bytes_acked
> handling.

Okay, thanks, merged now.

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-07 22:42               ` Stefano Brivio
@ 2025-10-08  0:41                 ` David Gibson
  2025-10-08 10:01                   ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-08  0:41 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3162 bytes --]

On Wed, Oct 08, 2025 at 12:42:49AM +0200, Stefano Brivio wrote:
> On Tue, 7 Oct 2025 10:34:01 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote:
> > > On Fri, 3 Oct 2025 13:43:32 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
[snip]
> > > > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > >  			break;
> > > >  		seq_from_tap += size;
> > > >  		iov_i += count;
> > > > +		if (th->fin)
> > > > +			fin = 1;
> > > >  
> > > >  		if (keep == i)
> > > >  			keep = -1;
> > > > 
> > > > 
> > > > We'd need to double check that the "accept data segment" path is safe
> > > > with len == 0, of course.  
> > > 
> > > For sure it's not before d2c33f45f7be ("tcp: Convert
> > > tcp_data_from_tap() to use iov_tail"), because we might add
> > > zero-length segments to the tcp_iov array, and that would make
> > > backporting an otherwise simple and critical fix to slightly older
> > > versions rather complicated.  
> > 
> > Kinda.  It's not that complicated to deal with that case, by wrapping
> > the actual data processing in an `if (len) { ... }`
> 
> That's needed for sure, but do we risk looping forever on a particular
> batch of FIN segments without data with a given series of sequence
> numbers?
>
> Right now the loop terminates because the sequence moves forward. I'm
> not sure what happens without data while moving 'keep' around. Maybe it
> takes a few minutes to figure out (I haven't tried) but I wouldn't call
> that trivial.

That should be fine, because we need to advance the sequence by one
for the FIN anyway, so we will move forwards.  I might have forgotten
that in my quick example, which is a bug, but an easy one to fix.

> > > After that commit, I'm not sure about the behaviour of
> > > iov_tail_clone(). I think it will return 0, but it should be tested
> > > (that assumption, itself, but also that the fix still works).  
> > 
> > Right, I think it's safe in that case - that's what I was looking at.
> 
> Note that I didn't test this.

Understood.

> > > > But I think that will treat dataless and
> > > > with-data FINs the same way, and let them use the keep mechanism.  
> > > 
> > > Given that the only advantage of doing this would be to handle a rare
> > > (I guess) corner case, that is, an out-of-sequence FIN segment with
> > > data, which is not critical anyway because the FIN will be
> > > retransmitted, I would rather keep this (critical) fix as it is.
> > > 
> > > I would suggest filing a separate ticket or anyway sending a separate
> > > patch if you want to fix that other case.  
> > 
> > Fair enough.  I'll wait for you to get the basic fix merge, then I
> > might batch this cleanup/fixup with the reorg to clarify bytes_acked
> > handling.
> 
> Okay, thanks, merged now.
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-08  0:41                 ` David Gibson
@ 2025-10-08 10:01                   ` Stefano Brivio
  2025-10-09  2:43                     ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-08 10:01 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 8 Oct 2025 11:41:23 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Oct 08, 2025 at 12:42:49AM +0200, Stefano Brivio wrote:
> > On Tue, 7 Oct 2025 10:34:01 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:  
> > > On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote:  
> > > > On Fri, 3 Oct 2025 13:43:32 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:  
> [snip]
> > > > > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > > >  			break;
> > > > >  		seq_from_tap += size;
> > > > >  		iov_i += count;
> > > > > +		if (th->fin)
> > > > > +			fin = 1;
> > > > >  
> > > > >  		if (keep == i)
> > > > >  			keep = -1;
> > > > > 
> > > > > 
> > > > > We'd need to double check that the "accept data segment" path is safe
> > > > > with len == 0, of course.    
> > > > 
> > > > For sure it's not before d2c33f45f7be ("tcp: Convert
> > > > tcp_data_from_tap() to use iov_tail"), because we might add
> > > > zero-length segments to the tcp_iov array, and that would make
> > > > backporting an otherwise simple and critical fix to slightly older
> > > > versions rather complicated.    
> > > 
> > > Kinda.  It's not that complicated to deal with that case, by wrapping
> > > the actual data processing in an `if (len) { ... }`  
> > 
> > That's needed for sure, but do we risk looping forever on a particular
> > batch of FIN segments without data with a given series of sequence
> > numbers?
> >
> > Right now the loop terminates because the sequence moves forward. I'm
> > not sure what happens without data while moving 'keep' around. Maybe it
> > takes a few minutes to figure out (I haven't tried) but I wouldn't call
> > that trivial.  
> 
> That should be fine, because we need to advance the sequence by one
> for the FIN anyway, so we will move forwards.  I might have forgotten
> that in my quick example, which is a bug, but an easy one to fix.

But we don't, in that loop, and there's a specific reason for it. FIN
segments are a special case in that, if you receive more than one, you
don't advance the sequence by more than one, even if the segments
themselves are in the expected sequence.

If you want to move the sequence increase into that loop (which might
make sense with some extra care), perhaps it's worth doing that
together with the whole https://bugs.passt.top/show_bug.cgi?id=125 at
this point.

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-08 10:01                   ` Stefano Brivio
@ 2025-10-09  2:43                     ` David Gibson
  2025-10-09 19:28                       ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-10-09  2:43 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3016 bytes --]

On Wed, Oct 08, 2025 at 12:01:27PM +0200, Stefano Brivio wrote:
> On Wed, 8 Oct 2025 11:41:23 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Oct 08, 2025 at 12:42:49AM +0200, Stefano Brivio wrote:
> > > On Tue, 7 Oct 2025 10:34:01 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:  
> > > > On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote:  
> > > > > On Fri, 3 Oct 2025 13:43:32 +1000
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:  
> > [snip]
> > > > > > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > > > >  			break;
> > > > > >  		seq_from_tap += size;
> > > > > >  		iov_i += count;
> > > > > > +		if (th->fin)
> > > > > > +			fin = 1;
> > > > > >  
> > > > > >  		if (keep == i)
> > > > > >  			keep = -1;
> > > > > > 
> > > > > > 
> > > > > > We'd need to double check that the "accept data segment" path is safe
> > > > > > with len == 0, of course.    
> > > > > 
> > > > > For sure it's not before d2c33f45f7be ("tcp: Convert
> > > > > tcp_data_from_tap() to use iov_tail"), because we might add
> > > > > zero-length segments to the tcp_iov array, and that would make
> > > > > backporting an otherwise simple and critical fix to slightly older
> > > > > versions rather complicated.    
> > > > 
> > > > Kinda.  It's not that complicated to deal with that case, by wrapping
> > > > the actual data processing in an `if (len) { ... }`  
> > > 
> > > That's needed for sure, but do we risk looping forever on a particular
> > > batch of FIN segments without data with a given series of sequence
> > > numbers?
> > >
> > > Right now the loop terminates because the sequence moves forward. I'm
> > > not sure what happens without data while moving 'keep' around. Maybe it
> > > takes a few minutes to figure out (I haven't tried) but I wouldn't call
> > > that trivial.  
> > 
> > That should be fine, because we need to advance the sequence by one
> > for the FIN anyway, so we will move forwards.  I might have forgotten
> > that in my quick example, which is a bug, but an easy one to fix.
> 
> But we don't, in that loop, and there's a specific reason for it. FIN
> segments are a special case in that, if you receive more than one, you
> don't advance the sequence by more than one, even if the segments
> themselves are in the expected sequence.

Ok.  *thinks*.  So, if we both set fin = 1 and advance the sequence
inside an if (th->fin && !fin), that should do the trick, yes?

> If you want to move the sequence increase into that loop (which might
> make sense with some extra care), perhaps it's worth doing that
> together with the whole https://bugs.passt.top/show_bug.cgi?id=125 at
> this point.

Yeah, maybe.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence
  2025-10-09  2:43                     ` David Gibson
@ 2025-10-09 19:28                       ` Stefano Brivio
  0 siblings, 0 replies; 25+ messages in thread
From: Stefano Brivio @ 2025-10-09 19:28 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 9 Oct 2025 13:43:00 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Oct 08, 2025 at 12:01:27PM +0200, Stefano Brivio wrote:
> > On Wed, 8 Oct 2025 11:41:23 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Oct 08, 2025 at 12:42:49AM +0200, Stefano Brivio wrote:  
> > > > On Tue, 7 Oct 2025 10:34:01 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:    
> > > > > On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote:    
> > > > > > On Fri, 3 Oct 2025 13:43:32 +1000
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:    
> > > [snip]  
> > > > > > > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn,
> > > > > > >  			break;
> > > > > > >  		seq_from_tap += size;
> > > > > > >  		iov_i += count;
> > > > > > > +		if (th->fin)
> > > > > > > +			fin = 1;
> > > > > > >  
> > > > > > >  		if (keep == i)
> > > > > > >  			keep = -1;
> > > > > > > 
> > > > > > > 
> > > > > > > We'd need to double check that the "accept data segment" path is safe
> > > > > > > with len == 0, of course.      
> > > > > > 
> > > > > > For sure it's not before d2c33f45f7be ("tcp: Convert
> > > > > > tcp_data_from_tap() to use iov_tail"), because we might add
> > > > > > zero-length segments to the tcp_iov array, and that would make
> > > > > > backporting an otherwise simple and critical fix to slightly older
> > > > > > versions rather complicated.      
> > > > > 
> > > > > Kinda.  It's not that complicated to deal with that case, by wrapping
> > > > > the actual data processing in an `if (len) { ... }`    
> > > > 
> > > > That's needed for sure, but do we risk looping forever on a particular
> > > > batch of FIN segments without data with a given series of sequence
> > > > numbers?
> > > >
> > > > Right now the loop terminates because the sequence moves forward. I'm
> > > > not sure what happens without data while moving 'keep' around. Maybe it
> > > > takes a few minutes to figure out (I haven't tried) but I wouldn't call
> > > > that trivial.    
> > > 
> > > That should be fine, because we need to advance the sequence by one
> > > for the FIN anyway, so we will move forwards.  I might have forgotten
> > > that in my quick example, which is a bug, but an easy one to fix.  
> > 
> > But we don't, in that loop, and there's a specific reason for it. FIN
> > segments are a special case in that, if you receive more than one, you
> > don't advance the sequence by more than one, even if the segments
> > themselves are in the expected sequence.  
> 
> Ok.  *thinks*.  So, if we both set fin = 1 and advance the sequence
> inside an if (th->fin && !fin), that should do the trick, yes?

Yes, but mind that we must then rewind the sequence on 'partial_send'
as well as on any non-retried sendmsg() failure because we want to
pretend, in those cases, that we never received a FIN (we won't set
TAP_FIN_RCVD either).

I think it will be less awkward and probably more robust altogether to
make sure that the 'keep' mechanism can't be fooled by FIN segments
with looping sequences.

Actually, now that we have packet pools (I didn't have them when I
first implemented this logic in dc169643a457 "tcp: Full batched
processing for tap messages"), do we want to try and reorder segments
first, separately?

It's not great for cache locality but perhaps worth at least profiling,
and it would probably make things simpler.

In that case, though, maybe it's appropriate to have a complete, less
intrusive fix for out-of-sequence FIN handling. If we advance the
sequence there, and rewind it later, I think it's not that bad as an
intermediate step.

> > If you want to move the sequence increase into that loop (which might
> > make sense with some extra care), perhaps it's worth doing that
> > together with the whole https://bugs.passt.top/show_bug.cgi?id=125 at
> > this point.  
> 
> Yeah, maybe.

By the way a separate reordering step should make this easier as well,
but still I'm not sure it's justified, I think it depends on what
perf(1) says about it.

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away
  2025-10-02  0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio
                   ` (2 preceding siblings ...)
  2025-10-02  0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio
@ 2025-10-02  0:06 ` Stefano Brivio
  2025-10-02  3:00   ` David Gibson
  3 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-10-02  0:06 UTC (permalink / raw)
  To: passt-dev

...possibly with an updated window value. As we're discarding the
remaining data, we'll need to receive it again. If we don't request
a retransmission immediately, we'll see an apparent gap in the
sequence, and request a retransmission on the next data batch or
segment, but we're just wasting time like that. In packets:

28686   0.000007 88.198.0.164 → 93.235.151.95 16118 TCP 55414 → 47080 [ACK] Seq=80501 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU]
28687   0.000012 88.198.0.164 → 93.235.151.95 16118 TCP [TCP Window Full] 55414 → 47080 [ACK] Seq=96565 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU]

on this second data segment, we have a short sendmsg(), and

28688   0.000026 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0

consequently acknowledge it, without requesting a retransmission,

28689   0.000006 88.198.0.164 → 93.235.151.95 8866 HTTP HTTP/1.1 200 ok  (text/css)

so the server proceeds sending a bit more, but

28690   0.000016 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#1] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0
28691   0.000000 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#2] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0

we'll throw that data (from frame #28689) away, and finally request
a retransmission as we spotted the gap now.

Request a retransmission as soon as we know we'll need it, instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 5a7a607..ae5e7a1 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1876,7 +1876,7 @@ eintr:
 	}
 
 out:
-	if (keep != -1) {
+	if (keep != -1 || partial_send) {
 		/* We use an 8-bit approximation here: the associated risk is
 		 * that we skip a duplicate ACK on 8-bit sequence number
 		 * collision. Fast retransmit is a SHOULD in RFC 5681, 3.2.
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away
  2025-10-02  0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio
@ 2025-10-02  3:00   ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-10-02  3:00 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2435 bytes --]

On Thu, Oct 02, 2025 at 02:06:46AM +0200, Stefano Brivio wrote:
> ...possibly with an updated window value. As we're discarding the
> remaining data, we'll need to receive it again. If we don't request
> a retransmission immediately, we'll see an apparent gap in the
> sequence, and request a retransmission on the next data batch or
> segment, but we're just wasting time like that. In packets:
> 
> 28686   0.000007 88.198.0.164 → 93.235.151.95 16118 TCP 55414 → 47080 [ACK] Seq=80501 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU]
> 28687   0.000012 88.198.0.164 → 93.235.151.95 16118 TCP [TCP Window Full] 55414 → 47080 [ACK] Seq=96565 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU]
> 
> on this second data segment, we have a short sendmsg(), and
> 
> 28688   0.000026 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0
> 
> consequently acknowledge it, without requesting a retransmission,
> 
> 28689   0.000006 88.198.0.164 → 93.235.151.95 8866 HTTP HTTP/1.1 200 ok  (text/css)
> 
> so the server proceeds sending a bit more, but
> 
> 28690   0.000016 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#1] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0
> 28691   0.000000 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#2] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0
> 
> we'll throw that data (from frame #28689) away, and finally request
> a retransmission as we spotted the gap now.
> 
> Request a retransmission as soon as we know we'll need it, instead.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 5a7a607..ae5e7a1 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1876,7 +1876,7 @@ eintr:
>  	}
>  
>  out:
> -	if (keep != -1) {
> +	if (keep != -1 || partial_send) {
>  		/* We use an 8-bit approximation here: the associated risk is
>  		 * that we skip a duplicate ACK on 8-bit sequence number
>  		 * collision. Fast retransmit is a SHOULD in RFC 5681, 3.2.
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-10-09 19:28 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-02  0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio
2025-10-02  0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio
2025-10-02  2:41   ` David Gibson
2025-10-02 11:58     ` Stefano Brivio
2025-10-03  3:19       ` David Gibson
2025-10-06 22:32         ` Stefano Brivio
2025-10-06 23:31           ` David Gibson
2025-10-07 22:42             ` Stefano Brivio
2025-10-08  0:42               ` David Gibson
2025-10-02  0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio
2025-10-02  2:44   ` David Gibson
2025-10-02  0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio
2025-10-02  2:52   ` David Gibson
2025-10-02  3:02     ` David Gibson
2025-10-02 11:51       ` Stefano Brivio
2025-10-03  3:43         ` David Gibson
2025-10-06 22:32           ` Stefano Brivio
2025-10-06 23:34             ` David Gibson
2025-10-07 22:42               ` Stefano Brivio
2025-10-08  0:41                 ` David Gibson
2025-10-08 10:01                   ` Stefano Brivio
2025-10-09  2:43                     ` David Gibson
2025-10-09 19:28                       ` Stefano Brivio
2025-10-02  0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio
2025-10-02  3:00   ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).