* [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues @ 2025-10-02 0:06 Stefano Brivio 2025-10-02 0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio ` (3 more replies) 0 siblings, 4 replies; 18+ messages in thread From: Stefano Brivio @ 2025-10-02 0:06 UTC (permalink / raw) To: passt-dev The most important fix here is 3/4, the other ones are either fixes for issues that weren't actually observed in practice, but I mistook them as the case fixed by 3/4, or for minor details. Stefano Brivio (4): tcp: Fix ACK sequence on FIN to tap tcp: Completely ignore data segment in CLOSE-WAIT state, log a message tcp: Don't consider FIN flags with mismatching sequence tcp: On partial send (incomplete sendmsg()), request a retransmission right away tcp.c | 12 +++++++++--- tcp_buf.c | 14 +++++++++++++- tcp_vu.c | 7 ++++++- 3 files changed, 28 insertions(+), 5 deletions(-) -- 2.43.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap 2025-10-02 0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio @ 2025-10-02 0:06 ` Stefano Brivio 2025-10-02 2:41 ` David Gibson 2025-10-02 0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio ` (2 subsequent siblings) 3 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-02 0:06 UTC (permalink / raw) To: passt-dev If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and send a FIN segment to the guest / container acknowledging a sequence number that's behind what we received so far, we won't have any further trigger to send an updated ACK segment, as we are now switching the epoll socket monitoring to edge-triggered mode. To avoid this situation, in tcp_update_seqack_wnd(), we set the next acknowledgement sequence to the current observed sequence, regardless of what was acknowledged socket-side. However, we don't necessarily call tcp_update_seqack_wnd() before sending the FIN segment, which might potentially lead to a situation, not observed in practice, where we unnecessarily cause a retransmission at some point after our FIN segment. Avoid that by setting the ACK sequence to whatever we received from the container / guest, before sending a FIN segment and switching to EPOLLET. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp_buf.c | 14 +++++++++++++- tcp_vu.c | 7 ++++++- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/tcp_buf.c b/tcp_buf.c index a493b5a..cc106bc 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) conn_flag(c, conn, STALLED); } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) { - int ret = tcp_buf_send_flag(c, conn, FIN | ACK); + int ret; + + /* On TAP_FIN_SENT, we won't get further data events + * from the socket, and this might be the last ACK + * segment we send to the tap, so update its sequence to + * include everything we received until now. + * + * See also the special handling on CONN_IS_CLOSING() in + * tcp_update_seqack_wnd(). + */ + conn->seq_ack_to_tap = conn->seq_from_tap; + + ret = tcp_buf_send_flag(c, conn, FIN | ACK); if (ret) { tcp_rst(c, conn); return ret; diff --git a/tcp_vu.c b/tcp_vu.c index ebd3a1e..3ec3538 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) conn_flag(c, conn, STALLED); } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) { - int ret = tcp_vu_send_flag(c, conn, FIN | ACK); + int ret; + + /* See tcp_buf_data_from_sock() */ + conn->seq_ack_to_tap = conn->seq_from_tap; + + ret = tcp_vu_send_flag(c, conn, FIN | ACK); if (ret) { tcp_rst(c, conn); return ret; -- 2.43.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap 2025-10-02 0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio @ 2025-10-02 2:41 ` David Gibson 2025-10-02 11:58 ` Stefano Brivio 0 siblings, 1 reply; 18+ messages in thread From: David Gibson @ 2025-10-02 2:41 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 3871 bytes --] On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote: > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and > send a FIN segment to the guest / container acknowledging a sequence > number that's behind what we received so far, we won't have any > further trigger to send an updated ACK segment, as we are now > switching the epoll socket monitoring to edge-triggered mode. > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next > acknowledgement sequence to the current observed sequence, regardless > of what was acknowledged socket-side. To double check my understanding: things should work if we always acknowledged everything we've received. Acknowledging only what the peer has acked is a refinement to give the guest a view that's closer to what it would be end-to-end with the peer (which might improve the operation of flow control). We can't use that refined mechanism when the socket is closing (amongst other cases), because while we can get the peer acked bytes from TCP_INFO, we can't get events when that changes, so we have no mechanism to provide updates to the guest at the right time. So we fall back to the simpler method. Is that correct? > However, we don't necessarily call tcp_update_seqack_wnd() before > sending the FIN segment, which might potentially lead to a situation, > not observed in practice, where we unnecessarily cause a > retransmission at some point after our FIN segment. > > Avoid that by setting the ACK sequence to whatever we received from > the container / guest, before sending a FIN segment and switching to > EPOLLET. > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Based on my understanding above, this looks correct to me, so, Reviewed-by: David Gibson <david@gibson.dropbear.id.au> My only concern is whether we could instead insert an extra call to tcp_update_seqack_wnd() to reduce duplicated logic. > --- > tcp_buf.c | 14 +++++++++++++- > tcp_vu.c | 7 ++++++- > 2 files changed, 19 insertions(+), 2 deletions(-) > > diff --git a/tcp_buf.c b/tcp_buf.c > index a493b5a..cc106bc 100644 > --- a/tcp_buf.c > +++ b/tcp_buf.c > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > conn_flag(c, conn, STALLED); > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > SOCK_FIN_RCVD) { > - int ret = tcp_buf_send_flag(c, conn, FIN | ACK); > + int ret; > + > + /* On TAP_FIN_SENT, we won't get further data events > + * from the socket, and this might be the last ACK > + * segment we send to the tap, so update its sequence to > + * include everything we received until now. > + * > + * See also the special handling on CONN_IS_CLOSING() in > + * tcp_update_seqack_wnd(). > + */ > + conn->seq_ack_to_tap = conn->seq_from_tap; > + > + ret = tcp_buf_send_flag(c, conn, FIN | ACK); > if (ret) { > tcp_rst(c, conn); > return ret; > diff --git a/tcp_vu.c b/tcp_vu.c > index ebd3a1e..3ec3538 100644 > --- a/tcp_vu.c > +++ b/tcp_vu.c > @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > conn_flag(c, conn, STALLED); > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > SOCK_FIN_RCVD) { > - int ret = tcp_vu_send_flag(c, conn, FIN | ACK); > + int ret; > + > + /* See tcp_buf_data_from_sock() */ > + conn->seq_ack_to_tap = conn->seq_from_tap; > + > + ret = tcp_vu_send_flag(c, conn, FIN | ACK); > if (ret) { > tcp_rst(c, conn); > return ret; > -- > 2.43.0 > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap 2025-10-02 2:41 ` David Gibson @ 2025-10-02 11:58 ` Stefano Brivio 2025-10-03 3:19 ` David Gibson 0 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-02 11:58 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Thu, 2 Oct 2025 12:41:08 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote: > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and > > send a FIN segment to the guest / container acknowledging a sequence > > number that's behind what we received so far, we won't have any > > further trigger to send an updated ACK segment, as we are now > > switching the epoll socket monitoring to edge-triggered mode. > > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next > > acknowledgement sequence to the current observed sequence, regardless > > of what was acknowledged socket-side. > > To double check my understanding: things should work if we always > acknowledged everything we've received. Acknowledging only what the > peer has acked is a refinement to give the guest a view that's closer > to what it would be end-to-end with the peer (which might improve the > operation of flow control). Right. > We can't use that refined mechanism when the socket is closing > (amongst other cases), because while we can get the peer acked bytes > from TCP_INFO, we can't get events when that changes, so we have no > mechanism to provide updates to the guest at the right time. So we > fall back to the simpler method. > > Is that correct? Also correct, yes. If you have a better idea to summarise this in the comment in tcp_buf_data_from_sock() let me know. Maybe I could mention EPOLLET explicitly there? > > However, we don't necessarily call tcp_update_seqack_wnd() before > > sending the FIN segment, which might potentially lead to a situation, > > not observed in practice, where we unnecessarily cause a > > retransmission at some point after our FIN segment. > > > > Avoid that by setting the ACK sequence to whatever we received from > > the container / guest, before sending a FIN segment and switching to > > EPOLLET. > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > Based on my understanding above, this looks correct to me, so, > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > > My only concern is whether we could instead insert an extra call to > tcp_update_seqack_wnd() to reduce duplicated logic. Hmm, maybe, but on the other hand we're closing the connection. Should we really spend time querying TCP_INFO to recalculate the window at this point? I wouldn't. > > --- > > tcp_buf.c | 14 +++++++++++++- > > tcp_vu.c | 7 ++++++- > > 2 files changed, 19 insertions(+), 2 deletions(-) > > > > diff --git a/tcp_buf.c b/tcp_buf.c > > index a493b5a..cc106bc 100644 > > --- a/tcp_buf.c > > +++ b/tcp_buf.c > > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > > conn_flag(c, conn, STALLED); > > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > > SOCK_FIN_RCVD) { > > - int ret = tcp_buf_send_flag(c, conn, FIN | ACK); > > + int ret; > > + > > + /* On TAP_FIN_SENT, we won't get further data events > > + * from the socket, and this might be the last ACK > > + * segment we send to the tap, so update its sequence to > > + * include everything we received until now. > > + * > > + * See also the special handling on CONN_IS_CLOSING() in > > + * tcp_update_seqack_wnd(). > > + */ > > + conn->seq_ack_to_tap = conn->seq_from_tap; > > + > > + ret = tcp_buf_send_flag(c, conn, FIN | ACK); > > if (ret) { > > tcp_rst(c, conn); > > return ret; > > diff --git a/tcp_vu.c b/tcp_vu.c > > index ebd3a1e..3ec3538 100644 > > --- a/tcp_vu.c > > +++ b/tcp_vu.c > > @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > > conn_flag(c, conn, STALLED); > > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > > SOCK_FIN_RCVD) { > > - int ret = tcp_vu_send_flag(c, conn, FIN | ACK); > > + int ret; > > + > > + /* See tcp_buf_data_from_sock() */ > > + conn->seq_ack_to_tap = conn->seq_from_tap; > > + > > + ret = tcp_vu_send_flag(c, conn, FIN | ACK); > > if (ret) { > > tcp_rst(c, conn); > > return ret; > > -- > > 2.43.0 > > -- Stefano ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap 2025-10-02 11:58 ` Stefano Brivio @ 2025-10-03 3:19 ` David Gibson 2025-10-06 22:32 ` Stefano Brivio 0 siblings, 1 reply; 18+ messages in thread From: David Gibson @ 2025-10-03 3:19 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 5560 bytes --] On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote: > On Thu, 2 Oct 2025 12:41:08 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote: > > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and > > > send a FIN segment to the guest / container acknowledging a sequence > > > number that's behind what we received so far, we won't have any > > > further trigger to send an updated ACK segment, as we are now > > > switching the epoll socket monitoring to edge-triggered mode. > > > > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next > > > acknowledgement sequence to the current observed sequence, regardless > > > of what was acknowledged socket-side. > > > > To double check my understanding: things should work if we always > > acknowledged everything we've received. Acknowledging only what the > > peer has acked is a refinement to give the guest a view that's closer > > to what it would be end-to-end with the peer (which might improve the > > operation of flow control). > > Right. > > > We can't use that refined mechanism when the socket is closing > > (amongst other cases), because while we can get the peer acked bytes > > from TCP_INFO, we can't get events when that changes, so we have no > > mechanism to provide updates to the guest at the right time. So we > > fall back to the simpler method. > > > > Is that correct? > > Also correct, yes. If you have a better idea to summarise this in the > comment in tcp_buf_data_from_sock() let me know. Hm, I might. Or actually a way to reorganise the code that I think will be a bit clearer and probably allow a clearer comment too. > Maybe I could mention > EPOLLET explicitly there? I don't think EPOLLET is actually relevant. Even if we had level triggered events, a change in bytes_acked doesn't count as an event (AFAIK). So either some other event is on, in which case we'd effectively be busy polling bytes_acked, or it's not in which case we don't get updates, just like now. I principle we could implement some sort of timer based polling, but that sounds like way more trouble than it's worth. > > > However, we don't necessarily call tcp_update_seqack_wnd() before > > > sending the FIN segment, which might potentially lead to a situation, > > > not observed in practice, where we unnecessarily cause a > > > retransmission at some point after our FIN segment. > > > > > > Avoid that by setting the ACK sequence to whatever we received from > > > the container / guest, before sending a FIN segment and switching to > > > EPOLLET. > > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > Based on my understanding above, this looks correct to me, so, > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > > > > My only concern is whether we could instead insert an extra call to > > tcp_update_seqack_wnd() to reduce duplicated logic. > > Hmm, maybe, but on the other hand we're closing the connection. Should > we really spend time querying TCP_INFO to recalculate the window at > this point? I wouldn't. Good point. I mean tcp_update_seqack_wnd() could skip the TCP_INFO in that case, but that does look a bit fiddly. On the other hand, in favour of not duplicating logic... [snip] > > > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > > > conn_flag(c, conn, STALLED); > > > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > > > SOCK_FIN_RCVD) { > > > - int ret = tcp_buf_send_flag(c, conn, FIN | ACK); > > > + int ret; > > > + > > > + /* On TAP_FIN_SENT, we won't get further data events > > > + * from the socket, and this might be the last ACK > > > + * segment we send to the tap, so update its sequence to > > > + * include everything we received until now. > > > + * > > > + * See also the special handling on CONN_IS_CLOSING() in > > > + * tcp_update_seqack_wnd(). > > > + */ > > > + conn->seq_ack_to_tap = conn->seq_from_tap; ... the equivalent bits in tcp_update_seqack_wnd() have after them: if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) conn->seq_ack_to_tap = prev_ack_to_tap; Don't we need that here as well, in case the guest is retransmitting when we get the sock side FIN? > > > + > > > + ret = tcp_buf_send_flag(c, conn, FIN | ACK); > > > if (ret) { > > > tcp_rst(c, conn); > > > return ret; > > > diff --git a/tcp_vu.c b/tcp_vu.c > > > index ebd3a1e..3ec3538 100644 > > > --- a/tcp_vu.c > > > +++ b/tcp_vu.c > > > @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > > > conn_flag(c, conn, STALLED); > > > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > > > SOCK_FIN_RCVD) { > > > - int ret = tcp_vu_send_flag(c, conn, FIN | ACK); > > > + int ret; > > > + > > > + /* See tcp_buf_data_from_sock() */ > > > + conn->seq_ack_to_tap = conn->seq_from_tap; > > > + > > > + ret = tcp_vu_send_flag(c, conn, FIN | ACK); > > > if (ret) { > > > tcp_rst(c, conn); > > > return ret; > > > -- > > > 2.43.0 > > > > > -- > Stefano > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap 2025-10-03 3:19 ` David Gibson @ 2025-10-06 22:32 ` Stefano Brivio 2025-10-06 23:31 ` David Gibson 0 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-06 22:32 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Fri, 3 Oct 2025 13:19:17 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote: > > On Thu, 2 Oct 2025 12:41:08 +1000 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote: > > > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and > > > > send a FIN segment to the guest / container acknowledging a sequence > > > > number that's behind what we received so far, we won't have any > > > > further trigger to send an updated ACK segment, as we are now > > > > switching the epoll socket monitoring to edge-triggered mode. > > > > > > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next > > > > acknowledgement sequence to the current observed sequence, regardless > > > > of what was acknowledged socket-side. > > > > > > To double check my understanding: things should work if we always > > > acknowledged everything we've received. Acknowledging only what the > > > peer has acked is a refinement to give the guest a view that's closer > > > to what it would be end-to-end with the peer (which might improve the > > > operation of flow control). > > > > Right. > > > > > We can't use that refined mechanism when the socket is closing > > > (amongst other cases), because while we can get the peer acked bytes > > > from TCP_INFO, we can't get events when that changes, so we have no > > > mechanism to provide updates to the guest at the right time. So we > > > fall back to the simpler method. > > > > > > Is that correct? > > > > Also correct, yes. If you have a better idea to summarise this in the > > comment in tcp_buf_data_from_sock() let me know. > > Hm, I might. Or actually a way to reorganise the code that I think > will be a bit clearer and probably allow a clearer comment too. I would keep reworks for a later moment. Right now, it's already taking me long enough to find a moment to investigate these issues, write these fixes, and test them. > > Maybe I could mention > > EPOLLET explicitly there? > > I don't think EPOLLET is actually relevant. Even if we had level > triggered events, a change in bytes_acked doesn't count as an event > (AFAIK). It doesn't count, but with level-triggered events, we would be busy polling bytes_acked, as you noted. I was mentioning EPOLLET because it could be taken, intuitively, as a "stop listening for events" (almost) step. I'll leave that out then. > So either some other event is on, in which case we'd > effectively be busy polling bytes_acked, or it's not in which case we > don't get updates, just like now. > > I principle we could implement some sort of timer based polling, but > that sounds like way more trouble than it's worth. We already have something similar, based on ACK_INTERVAL and ACK_TO_TAP_DUE, and it shouldn't be overly complicated to extend that to a new FIN_TO_TAP_DUE flag. But indeed beyond the scope of this series. > > > > However, we don't necessarily call tcp_update_seqack_wnd() before > > > > sending the FIN segment, which might potentially lead to a situation, > > > > not observed in practice, where we unnecessarily cause a > > > > retransmission at some point after our FIN segment. > > > > > > > > Avoid that by setting the ACK sequence to whatever we received from > > > > the container / guest, before sending a FIN segment and switching to > > > > EPOLLET. > > > > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > > > Based on my understanding above, this looks correct to me, so, > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > > > > > > My only concern is whether we could instead insert an extra call to > > > tcp_update_seqack_wnd() to reduce duplicated logic. > > > > Hmm, maybe, but on the other hand we're closing the connection. Should > > we really spend time querying TCP_INFO to recalculate the window at > > this point? I wouldn't. > > Good point. I mean tcp_update_seqack_wnd() could skip the TCP_INFO in > that case, but that does look a bit fiddly. > > On the other hand, in favour of not duplicating logic... > > [snip] > > > > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > > > > conn_flag(c, conn, STALLED); > > > > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > > > > SOCK_FIN_RCVD) { > > > > - int ret = tcp_buf_send_flag(c, conn, FIN | ACK); > > > > + int ret; > > > > + > > > > + /* On TAP_FIN_SENT, we won't get further data events > > > > + * from the socket, and this might be the last ACK > > > > + * segment we send to the tap, so update its sequence to > > > > + * include everything we received until now. > > > > + * > > > > + * See also the special handling on CONN_IS_CLOSING() in > > > > + * tcp_update_seqack_wnd(). > > > > + */ > > > > + conn->seq_ack_to_tap = conn->seq_from_tap; > > ... the equivalent bits in tcp_update_seqack_wnd() have after them: > if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > conn->seq_ack_to_tap = prev_ack_to_tap; > > Don't we need that here as well, in case the guest is retransmitting > when we get the sock side FIN? Not really, because we don't rewind conn->seq_from_tap, so we don't risk jumping back here. In tcp_update_seqack_wnd(), we might jump back (that should be double checked eventually, I'm not sure it's still the case) if we happened to acknowledge more than acknowledged socket-side while handling some particular condition, and then we switch back to acknowledging only bytes_acked. It should happen if the destination is/was a low RTT one, but we run out of slots in low_rtt_dst in favour of other entries. I don't remember any other case. > > > > + > > > > + ret = tcp_buf_send_flag(c, conn, FIN | ACK); > > > > if (ret) { > > > > tcp_rst(c, conn); > > > > return ret; > > > > diff --git a/tcp_vu.c b/tcp_vu.c > > > > index ebd3a1e..3ec3538 100644 > > > > --- a/tcp_vu.c > > > > +++ b/tcp_vu.c > > > > @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > > > > conn_flag(c, conn, STALLED); > > > > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > > > > SOCK_FIN_RCVD) { > > > > - int ret = tcp_vu_send_flag(c, conn, FIN | ACK); > > > > + int ret; > > > > + > > > > + /* See tcp_buf_data_from_sock() */ > > > > + conn->seq_ack_to_tap = conn->seq_from_tap; > > > > + > > > > + ret = tcp_vu_send_flag(c, conn, FIN | ACK); > > > > if (ret) { > > > > tcp_rst(c, conn); > > > > return ret; > > > > -- > > > > 2.43.0 > > > > -- Stefano ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap 2025-10-06 22:32 ` Stefano Brivio @ 2025-10-06 23:31 ` David Gibson 0 siblings, 0 replies; 18+ messages in thread From: David Gibson @ 2025-10-06 23:31 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 7017 bytes --] On Tue, Oct 07, 2025 at 12:32:19AM +0200, Stefano Brivio wrote: > On Fri, 3 Oct 2025 13:19:17 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote: > > > On Thu, 2 Oct 2025 12:41:08 +1000 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote: > > > > > If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and > > > > > send a FIN segment to the guest / container acknowledging a sequence > > > > > number that's behind what we received so far, we won't have any > > > > > further trigger to send an updated ACK segment, as we are now > > > > > switching the epoll socket monitoring to edge-triggered mode. > > > > > > > > > > To avoid this situation, in tcp_update_seqack_wnd(), we set the next > > > > > acknowledgement sequence to the current observed sequence, regardless > > > > > of what was acknowledged socket-side. > > > > > > > > To double check my understanding: things should work if we always > > > > acknowledged everything we've received. Acknowledging only what the > > > > peer has acked is a refinement to give the guest a view that's closer > > > > to what it would be end-to-end with the peer (which might improve the > > > > operation of flow control). > > > > > > Right. > > > > > > > We can't use that refined mechanism when the socket is closing > > > > (amongst other cases), because while we can get the peer acked bytes > > > > from TCP_INFO, we can't get events when that changes, so we have no > > > > mechanism to provide updates to the guest at the right time. So we > > > > fall back to the simpler method. > > > > > > > > Is that correct? > > > > > > Also correct, yes. If you have a better idea to summarise this in the > > > comment in tcp_buf_data_from_sock() let me know. > > > > Hm, I might. Or actually a way to reorganise the code that I think > > will be a bit clearer and probably allow a clearer comment too. > > I would keep reworks for a later moment. Right now, it's already taking > me long enough to find a moment to investigate these issues, write these > fixes, and test them. I mean... the change I'm proposing reduces lines of code (excepting the big new comment), makes it easier to reason about and is localised to the immediately surrounding code. But whatever, I don't particularly care about the order we do things. > > > Maybe I could mention > > > EPOLLET explicitly there? > > > > I don't think EPOLLET is actually relevant. Even if we had level > > triggered events, a change in bytes_acked doesn't count as an event > > (AFAIK). > > It doesn't count, but with level-triggered events, we would be busy > polling bytes_acked, as you noted. I was mentioning EPOLLET because it > could be taken, intuitively, as a "stop listening for events" (almost) > step. I'll leave that out then. I mean, if busy polling were acceptable we could accomplish that easily enough by doing it in tcp_defer_handler() regardless of EPOLLET. > > So either some other event is on, in which case we'd > > effectively be busy polling bytes_acked, or it's not in which case we > > don't get updates, just like now. > > > > I principle we could implement some sort of timer based polling, but > > that sounds like way more trouble than it's worth. > > We already have something similar, based on ACK_INTERVAL and > ACK_TO_TAP_DUE, and it shouldn't be overly complicated to extend that > to a new FIN_TO_TAP_DUE flag. But indeed beyond the scope of this > series. Certainly. > > > > > However, we don't necessarily call tcp_update_seqack_wnd() before > > > > > sending the FIN segment, which might potentially lead to a situation, > > > > > not observed in practice, where we unnecessarily cause a > > > > > retransmission at some point after our FIN segment. > > > > > > > > > > Avoid that by setting the ACK sequence to whatever we received from > > > > > the container / guest, before sending a FIN segment and switching to > > > > > EPOLLET. > > > > > > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > > > > > > Based on my understanding above, this looks correct to me, so, > > > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > > > > > > > > My only concern is whether we could instead insert an extra call to > > > > tcp_update_seqack_wnd() to reduce duplicated logic. > > > > > > Hmm, maybe, but on the other hand we're closing the connection. Should > > > we really spend time querying TCP_INFO to recalculate the window at > > > this point? I wouldn't. > > > > Good point. I mean tcp_update_seqack_wnd() could skip the TCP_INFO in > > that case, but that does look a bit fiddly. > > > > On the other hand, in favour of not duplicating logic... > > > > [snip] > > > > > @@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) > > > > > conn_flag(c, conn, STALLED); > > > > > } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == > > > > > SOCK_FIN_RCVD) { > > > > > - int ret = tcp_buf_send_flag(c, conn, FIN | ACK); > > > > > + int ret; > > > > > + > > > > > + /* On TAP_FIN_SENT, we won't get further data events > > > > > + * from the socket, and this might be the last ACK > > > > > + * segment we send to the tap, so update its sequence to > > > > > + * include everything we received until now. > > > > > + * > > > > > + * See also the special handling on CONN_IS_CLOSING() in > > > > > + * tcp_update_seqack_wnd(). > > > > > + */ > > > > > + conn->seq_ack_to_tap = conn->seq_from_tap; > > > > ... the equivalent bits in tcp_update_seqack_wnd() have after them: > > if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > > conn->seq_ack_to_tap = prev_ack_to_tap; > > > > Don't we need that here as well, in case the guest is retransmitting > > when we get the sock side FIN? > > Not really, because we don't rewind conn->seq_from_tap, so we don't > risk jumping back here. Ah, because this one's on the sock->tap data path, whereas the other calls are on the tap->sock data path. Good point. > In tcp_update_seqack_wnd(), we might jump back (that should be double > checked eventually, I'm not sure it's still the case) if we happened to > acknowledge more than acknowledged socket-side while handling some > particular condition, and then we switch back to acknowledging only > bytes_acked. > > It should happen if the destination is/was a low RTT one, but we run > out of slots in low_rtt_dst in favour of other entries. I don't > remember any other case. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message 2025-10-02 0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio 2025-10-02 0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio @ 2025-10-02 0:06 ` Stefano Brivio 2025-10-02 2:44 ` David Gibson 2025-10-02 0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio 2025-10-02 0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio 3 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-02 0:06 UTC (permalink / raw) To: passt-dev According to RFC 9293 we should ignore data (note: not data segments) in CLOSE-WAIT state (indicated by TAP_FIN_RCVD), see 3.10.7.4 "Other states": [...] Seventh, process the segment text: [...] CLOSE-WAIT STATE This should not occur since a FIN has been received from the remote side. Ignore the segment text. and we almost do that, except that we would look at the data length to decide whether it's a request for fast re-transmission, so fix that, and while at it, log a message, so that cases such as the following one are more apparent in debug logs: 28692 0.009758 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [FIN, ACK] Seq=121441 Ack=141 Win=65536 Len=0 we should ignore this FIN flag, because we didn't accept data up to this sequence (see next segment), but we don't do it, so, here: 28693 0.000036 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90722 Win=32128 Len=0 28694 0.034597 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [FIN, ACK] Seq=141 Ack=90722 Win=121216 Len=0 28695 0.000019 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [ACK] Seq=121442 Ack=142 Win=65536 Len=0 28696 0.162968 88.198.0.164 → 93.235.151.95 30773 TCP [TCP Retransmission] 55414 → 47080 [FIN, PSH, ACK] Seq=90722 Ack=142 Win=65536 Len=30719 [TCP segment of a reassembled PDU] we are erroneously in CLOSE-WAIT (TAP_FIN_RCVD) state, and this segment would look pretty strange there. This specific case is fixed by the next patch, so it should never happen again. Link: https://archives.passt.top/passt-dev/20250910115726.432bbb8d@elisabeth/ Link: https://bugs.passt.top/show_bug.cgi?id=126 Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index 48b1ef2..3f7dc82 100644 --- a/tcp.c +++ b/tcp.c @@ -2130,9 +2130,15 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, /* Established connections not accepting data from tap */ if (conn->events & TAP_FIN_RCVD) { + size_t dlen; bool retr; - retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin && + if ((dlen = tcp_packet_data_len(th, l4len))) { + flow_dbg(conn, "data segment in CLOSE-WAIT (%zu B)", + dlen); + } + + retr = th->ack && !th->fin && ntohl(th->ack_seq) == conn->seq_ack_from_tap && ntohs(th->window) == conn->wnd_from_tap; -- 2.43.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message 2025-10-02 0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio @ 2025-10-02 2:44 ` David Gibson 0 siblings, 0 replies; 18+ messages in thread From: David Gibson @ 2025-10-02 2:44 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 3030 bytes --] On Thu, Oct 02, 2025 at 02:06:44AM +0200, Stefano Brivio wrote: > According to RFC 9293 we should ignore data (note: not data segments) > in CLOSE-WAIT state (indicated by TAP_FIN_RCVD), see 3.10.7.4 > "Other states": > > [...] > > Seventh, process the segment text: > > [...] > > CLOSE-WAIT STATE > > This should not occur since a FIN has been received from the remote > side. Ignore the segment text. > > and we almost do that, except that we would look at the data length > to decide whether it's a request for fast re-transmission, so fix > that, and while at it, log a message, so that cases such as the > following one are more apparent in debug logs: > > 28692 0.009758 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [FIN, ACK] Seq=121441 Ack=141 Win=65536 Len=0 > > we should ignore this FIN flag, because we didn't accept data up > to this sequence (see next segment), but we don't do it, so, here: > > 28693 0.000036 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90722 Win=32128 Len=0 > 28694 0.034597 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [FIN, ACK] Seq=141 Ack=90722 Win=121216 Len=0 > 28695 0.000019 88.198.0.164 → 93.235.151.95 54 TCP 55414 → 47080 [ACK] Seq=121442 Ack=142 Win=65536 Len=0 > 28696 0.162968 88.198.0.164 → 93.235.151.95 30773 TCP [TCP Retransmission] 55414 → 47080 [FIN, PSH, ACK] Seq=90722 Ack=142 Win=65536 Len=30719 [TCP segment of a reassembled PDU] > > we are erroneously in CLOSE-WAIT (TAP_FIN_RCVD) state, and this > segment would look pretty strange there. > > This specific case is fixed by the next patch, so it should never > happen again. > > Link: https://archives.passt.top/passt-dev/20250910115726.432bbb8d@elisabeth/ > Link: https://bugs.passt.top/show_bug.cgi?id=126 > Suggested-by: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > --- > tcp.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/tcp.c b/tcp.c > index 48b1ef2..3f7dc82 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -2130,9 +2130,15 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, > > /* Established connections not accepting data from tap */ > if (conn->events & TAP_FIN_RCVD) { > + size_t dlen; > bool retr; > > - retr = th->ack && !tcp_packet_data_len(th, l4len) && !th->fin && > + if ((dlen = tcp_packet_data_len(th, l4len))) { > + flow_dbg(conn, "data segment in CLOSE-WAIT (%zu B)", > + dlen); > + } > + > + retr = th->ack && !th->fin && > ntohl(th->ack_seq) == conn->seq_ack_from_tap && > ntohs(th->window) == conn->wnd_from_tap; > > -- > 2.43.0 > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence 2025-10-02 0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio 2025-10-02 0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio 2025-10-02 0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio @ 2025-10-02 0:06 ` Stefano Brivio 2025-10-02 2:52 ` David Gibson 2025-10-02 0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio 3 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-02 0:06 UTC (permalink / raw) To: passt-dev If a guest or container sends us a FIN segment but its sequence number doesn't match the highest sequence of data we *accepted* (not necessarily the highest sequence we received), that is, conn->seq_from_tap, plus any data we're accepting in the current batch, we should discard the flag (not necessarily the segment), because there's still data we need to receive (again) before the end of the stream. If we consider those FIN flags as such, we'll end up in the situation described below. Here, 192.168.10.102 is a HTTP server in a Podman container, and 192.168.10.44 is a client fetching approximately 121 KB of data from it: 82 2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0 the server is done sending 83 2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0 pasta (client) acknowledges a previous sequence, because of a short sendmsg() 84 2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0 pasta (client) sends FIN, ACK as the client has no more data to send (a single GET request), while still acknowledging a previous sequence, because the retransmission didn't happen yet 85 2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0 the server acknowledges the FIN, ACK 86 2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU] and finally a retransmission comes, but as we wrongly switched to the CLOSE-WAIT state, 87 2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0 we consider frame #86 as an acknowledgement for the FIN segment we sent, and close the connection, while we still had to re-receive (and finally send) the missing data segment, instead. Link: https://github.com/containers/podman/issues/27179 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index 3f7dc82..5a7a607 100644 --- a/tcp.c +++ b/tcp.c @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, } } - if (th->fin) + if (th->fin && seq == seq_from_tap) fin = 1; if (!len) -- 2.43.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence 2025-10-02 0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio @ 2025-10-02 2:52 ` David Gibson 2025-10-02 3:02 ` David Gibson 0 siblings, 1 reply; 18+ messages in thread From: David Gibson @ 2025-10-02 2:52 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 3231 bytes --] On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote: > If a guest or container sends us a FIN segment but its sequence number > doesn't match the highest sequence of data we *accepted* (not > necessarily the highest sequence we received), that is, > conn->seq_from_tap, plus any data we're accepting in the current > batch, we should discard the flag (not necessarily the segment), > because there's still data we need to receive (again) before the end > of the stream. > > If we consider those FIN flags as such, we'll end up in the > situation described below. > > Here, 192.168.10.102 is a HTTP server in a Podman container, and > 192.168.10.44 is a client fetching approximately 121 KB of data from > it: > > 82 2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0 > > the server is done sending > > 83 2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0 > > pasta (client) acknowledges a previous sequence, because of > a short sendmsg() > > 84 2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0 > > pasta (client) sends FIN, ACK as the client has no more data to > send (a single GET request), while still acknowledging a previous > sequence, because the retransmission didn't happen yet > > 85 2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0 > > the server acknowledges the FIN, ACK > > 86 2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU] > > and finally a retransmission comes, but as we wrongly switched to > the CLOSE-WAIT state, > > 87 2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0 > > we consider frame #86 as an acknowledgement for the FIN segment we > sent, and close the connection, while we still had to re-receive > (and finally send) the missing data segment, instead. > > Link: https://github.com/containers/podman/issues/27179 > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > --- > tcp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tcp.c b/tcp.c > index 3f7dc82..5a7a607 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > } > } > > - if (th->fin) > + if (th->fin && seq == seq_from_tap) > fin = 1; Can a FIN segment also contain data? My quick googling suggests yes. If so, doesn't this logic need to go after we process the data processing, so that seq_from_tap points to the end of the packet's data, rather than the beginning? (And the handling of zero-length packets would also need revision to match). > > if (!len) > -- > 2.43.0 > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence 2025-10-02 2:52 ` David Gibson @ 2025-10-02 3:02 ` David Gibson 2025-10-02 11:51 ` Stefano Brivio 0 siblings, 1 reply; 18+ messages in thread From: David Gibson @ 2025-10-02 3:02 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 3657 bytes --] On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote: > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote: > > If a guest or container sends us a FIN segment but its sequence number > > doesn't match the highest sequence of data we *accepted* (not > > necessarily the highest sequence we received), that is, > > conn->seq_from_tap, plus any data we're accepting in the current > > batch, we should discard the flag (not necessarily the segment), > > because there's still data we need to receive (again) before the end > > of the stream. > > > > If we consider those FIN flags as such, we'll end up in the > > situation described below. > > > > Here, 192.168.10.102 is a HTTP server in a Podman container, and > > 192.168.10.44 is a client fetching approximately 121 KB of data from > > it: > > > > 82 2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0 > > > > the server is done sending > > > > 83 2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0 > > > > pasta (client) acknowledges a previous sequence, because of > > a short sendmsg() > > > > 84 2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0 > > > > pasta (client) sends FIN, ACK as the client has no more data to > > send (a single GET request), while still acknowledging a previous > > sequence, because the retransmission didn't happen yet > > > > 85 2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0 > > > > the server acknowledges the FIN, ACK > > > > 86 2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU] > > > > and finally a retransmission comes, but as we wrongly switched to > > the CLOSE-WAIT state, > > > > 87 2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0 > > > > we consider frame #86 as an acknowledgement for the FIN segment we > > sent, and close the connection, while we still had to re-receive > > (and finally send) the missing data segment, instead. > > > > Link: https://github.com/containers/podman/issues/27179 > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > --- > > tcp.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/tcp.c b/tcp.c > > index 3f7dc82..5a7a607 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > } > > } > > > > - if (th->fin) > > + if (th->fin && seq == seq_from_tap) > > fin = 1; > > Can a FIN segment also contain data? My quick googling suggests yes. > If so, doesn't this logic need to go after we process the data > processing, so that seq_from_tap points to the end of the packet's > data, rather than the beginning? (And the handling of zero-length > packets would also need revision to match). Following on from that, it seems to me like it would make sense for FIN segments to also participate in the 'keep' mechanism. It should work eventually, but I expect it would be smoother in the case that we get a final burst of packets in a stream out of order. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence 2025-10-02 3:02 ` David Gibson @ 2025-10-02 11:51 ` Stefano Brivio 2025-10-03 3:43 ` David Gibson 0 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-02 11:51 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Thu, 2 Oct 2025 13:02:09 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote: > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote: > > > If a guest or container sends us a FIN segment but its sequence number > > > doesn't match the highest sequence of data we *accepted* (not > > > necessarily the highest sequence we received), that is, > > > conn->seq_from_tap, plus any data we're accepting in the current > > > batch, we should discard the flag (not necessarily the segment), > > > because there's still data we need to receive (again) before the end > > > of the stream. > > > > > > If we consider those FIN flags as such, we'll end up in the > > > situation described below. > > > > > > Here, 192.168.10.102 is a HTTP server in a Podman container, and > > > 192.168.10.44 is a client fetching approximately 121 KB of data from > > > it: > > > > > > 82 2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0 > > > > > > the server is done sending > > > > > > 83 2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0 > > > > > > pasta (client) acknowledges a previous sequence, because of > > > a short sendmsg() > > > > > > 84 2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0 > > > > > > pasta (client) sends FIN, ACK as the client has no more data to > > > send (a single GET request), while still acknowledging a previous > > > sequence, because the retransmission didn't happen yet > > > > > > 85 2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0 > > > > > > the server acknowledges the FIN, ACK > > > > > > 86 2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU] > > > > > > and finally a retransmission comes, but as we wrongly switched to > > > the CLOSE-WAIT state, > > > > > > 87 2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0 > > > > > > we consider frame #86 as an acknowledgement for the FIN segment we > > > sent, and close the connection, while we still had to re-receive > > > (and finally send) the missing data segment, instead. > > > > > > Link: https://github.com/containers/podman/issues/27179 > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> > > > --- > > > tcp.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/tcp.c b/tcp.c > > > index 3f7dc82..5a7a607 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > } > > > } > > > > > > - if (th->fin) > > > + if (th->fin && seq == seq_from_tap) > > > fin = 1; > > > > Can a FIN segment also contain data? My quick googling suggests yes. Yes, absolutely, my slow wiresharking over the years also confirms, and it's so often the case that (I think) this issue doesn't happen so frequently as it only occurs if we have a FIN segment without data. If we have a data segment, with FIN set, that we can't fully transmit, we already set 'partial_send' and won't set TAP_FIN_RCVD as a result. Another case where we want to ignore a FIN segment with data is if we have a gap before it, but in that case we'll eventually set 'keep' and return early. > > If so, doesn't this logic need to go after we process the data > > processing, so that seq_from_tap points to the end of the packet's > > data, rather than the beginning? (And the handling of zero-length > > packets would also need revision to match). This made sense to me for a moment but now I'm struggling to understand or remember why. What I want to check here is that a FIN segment without data (I should have specified in the commit message) is acceptable because its sequence is as expected. But going back to FIN segments with data: why should we sum the length to seq_from_tap before comparing the sequence? I don't understand what additional check you want to introduce, or what case you want to cover. > Following on from that, it seems to me like it would make sense for > FIN segments to also participate in the 'keep' mechanism. It should > work eventually, but I expect it would be smoother in the case that we > get a final burst of packets in a stream out of order. FIN segments with data already go through that dance. Without data, I guess you're right, we might have in the same batch (not that I've ever seen it happening in practice) a FIN segment without data that we process first (and now discard because of the sequence number), and some data before that we process later, so we shouldn't throw away the FIN segment because of that. We should, conceptually, reorder it as well. It probably makes things more complicated for a reason that's not so critical (ignoring a FIN is fine, we'll get another one), and I wanted to have the simplest possible fix here. Let me see if I can make this entirely correct without a substantially bigger change, I haven't really tried. -- Stefano ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence 2025-10-02 11:51 ` Stefano Brivio @ 2025-10-03 3:43 ` David Gibson 2025-10-06 22:32 ` Stefano Brivio 0 siblings, 1 reply; 18+ messages in thread From: David Gibson @ 2025-10-03 3:43 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 4869 bytes --] On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote: > On Thu, 2 Oct 2025 13:02:09 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote: > > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote: [snip] > > > > diff --git a/tcp.c b/tcp.c > > > > index 3f7dc82..5a7a607 100644 > > > > --- a/tcp.c > > > > +++ b/tcp.c > > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > } > > > > } > > > > > > > > - if (th->fin) > > > > + if (th->fin && seq == seq_from_tap) > > > > fin = 1; > > > > > > Can a FIN segment also contain data? My quick googling suggests yes. > > Yes, absolutely, my slow wiresharking over the years also confirms, and > it's so often the case that (I think) this issue doesn't happen so > frequently as it only occurs if we have a FIN segment without data. Makes sense. > If we have a data segment, with FIN set, that we can't fully transmit, > we already set 'partial_send' and won't set TAP_FIN_RCVD as a result. > > Another case where we want to ignore a FIN segment with data is if we > have a gap before it, but in that case we'll eventually set 'keep' and > return early. Ah, right. I'd noticed we set fin = 1 in that case, but forgotten about the exit before setting TAP_FIN_RCVD if keep is set. > > > If so, doesn't this logic need to go after we process the data > > > processing, so that seq_from_tap points to the end of the packet's > > > data, rather than the beginning? (And the handling of zero-length > > > packets would also need revision to match). > > This made sense to me for a moment but now I'm struggling to understand > or remember why. What I want to check here is that a FIN segment > without data (I should have specified in the commit message) is > acceptable because its sequence is as expected. Right. This is correct for zero-data FIN segments, but I think as a side-effect you've made it ignore certain FIN segments _with_ data. It will work in the common case where the data exactly follows on from what we already have. But in the case where the segment has some data we already have and some new data, the fin = 1 won't trip because seq != seq_from_tap. There isn't another place that will catch it instead, AFAICT. I guess it will be fine in the end, because with all the data acked, the guest should retransmit the FIN with no data. > But going back to FIN segments with data: why should we sum the length > to seq_from_tap before comparing the sequence? I don't understand what > additional check you want to introduce, or what case you want to cover. I was thinking about the case above, but I didn't explain it very well. > > Following on from that, it seems to me like it would make sense for > > FIN segments to also participate in the 'keep' mechanism. It should > > work eventually, but I expect it would be smoother in the case that we > > get a final burst of packets in a stream out of order. > > FIN segments with data already go through that dance. > > Without data, I guess you're right, we might have in the same batch > (not that I've ever seen it happening in practice) a FIN segment > without data that we process first (and now discard because of the > sequence number), and some data before that we process later, so we > shouldn't throw away the FIN segment because of that. We should, > conceptually, reorder it as well. > > It probably makes things more complicated for a reason that's not so > critical (ignoring a FIN is fine, we'll get another one), and I wanted > to have the simplest possible fix here. > > Let me see if I can make this entirely correct without a substantially > bigger change, I haven't really tried. How about this: diff --git a/tcp.c b/tcp.c index 7da41797..42e576b4 100644 --- a/tcp.c +++ b/tcp.c @@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, } } - if (th->fin) - fin = 1; - - if (!len) + if (!len && !th->fin) continue; seq_offset = seq_from_tap - seq; @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, break; seq_from_tap += size; iov_i += count; + if (th->fin) + fin = 1; if (keep == i) keep = -1; We'd need to double check that the "accept data segment" path is safe with len == 0, of course. But I think that will treat dataless and with-data FINs the same way, and let them use the keep mechanism. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence 2025-10-03 3:43 ` David Gibson @ 2025-10-06 22:32 ` Stefano Brivio 2025-10-06 23:34 ` David Gibson 0 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-06 22:32 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Fri, 3 Oct 2025 13:43:32 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote: > > On Thu, 2 Oct 2025 13:02:09 +1000 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote: > > > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote: > [snip] > > > > > diff --git a/tcp.c b/tcp.c > > > > > index 3f7dc82..5a7a607 100644 > > > > > --- a/tcp.c > > > > > +++ b/tcp.c > > > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > > } > > > > > } > > > > > > > > > > - if (th->fin) > > > > > + if (th->fin && seq == seq_from_tap) > > > > > fin = 1; > > > > > > > > Can a FIN segment also contain data? My quick googling suggests yes. > > > > Yes, absolutely, my slow wiresharking over the years also confirms, and > > it's so often the case that (I think) this issue doesn't happen so > > frequently as it only occurs if we have a FIN segment without data. > > Makes sense. > > > If we have a data segment, with FIN set, that we can't fully transmit, > > we already set 'partial_send' and won't set TAP_FIN_RCVD as a result. > > > > Another case where we want to ignore a FIN segment with data is if we > > have a gap before it, but in that case we'll eventually set 'keep' and > > return early. > > Ah, right. I'd noticed we set fin = 1 in that case, but forgotten > about the exit before setting TAP_FIN_RCVD if keep is set. > > > > > If so, doesn't this logic need to go after we process the data > > > > processing, so that seq_from_tap points to the end of the packet's > > > > data, rather than the beginning? (And the handling of zero-length > > > > packets would also need revision to match). > > > > This made sense to me for a moment but now I'm struggling to understand > > or remember why. What I want to check here is that a FIN segment > > without data (I should have specified in the commit message) is > > acceptable because its sequence is as expected. > > Right. This is correct for zero-data FIN segments, but I think as a > side-effect you've made it ignore certain FIN segments _with_ data. > It will work in the common case where the data exactly follows on from > what we already have. But in the case where the segment has some data > we already have and some new data, the fin = 1 won't trip because seq > != seq_from_tap. There isn't another place that will catch it > instead, AFAICT. > > I guess it will be fine in the end, because with all the data acked, > the guest should retransmit the FIN with no data. > > > But going back to FIN segments with data: why should we sum the length > > to seq_from_tap before comparing the sequence? I don't understand what > > additional check you want to introduce, or what case you want to cover. > > I was thinking about the case above, but I didn't explain it very > well. > > > > Following on from that, it seems to me like it would make sense for > > > FIN segments to also participate in the 'keep' mechanism. It should > > > work eventually, but I expect it would be smoother in the case that we > > > get a final burst of packets in a stream out of order. > > > > FIN segments with data already go through that dance. > > > > Without data, I guess you're right, we might have in the same batch > > (not that I've ever seen it happening in practice) a FIN segment > > without data that we process first (and now discard because of the > > sequence number), and some data before that we process later, so we > > shouldn't throw away the FIN segment because of that. We should, > > conceptually, reorder it as well. > > > > It probably makes things more complicated for a reason that's not so > > critical (ignoring a FIN is fine, we'll get another one), and I wanted > > to have the simplest possible fix here. > > > > Let me see if I can make this entirely correct without a substantially > > bigger change, I haven't really tried. > > How about this: > > diff --git a/tcp.c b/tcp.c > index 7da41797..42e576b4 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > } > } > > - if (th->fin) > - fin = 1; > - > - if (!len) > + if (!len && !th->fin) > continue; > > seq_offset = seq_from_tap - seq; > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > break; > seq_from_tap += size; > iov_i += count; > + if (th->fin) > + fin = 1; > > if (keep == i) > keep = -1; > > > We'd need to double check that the "accept data segment" path is safe > with len == 0, of course. For sure it's not before d2c33f45f7be ("tcp: Convert tcp_data_from_tap() to use iov_tail"), because we might add zero-length segments to the tcp_iov array, and that would make backporting an otherwise simple and critical fix to slightly older versions rather complicated. After that commit, I'm not sure about the behaviour of iov_tail_clone(). I think it will return 0, but it should be tested (that assumption, itself, but also that the fix still works). > But I think that will treat dataless and > with-data FINs the same way, and let them use the keep mechanism. Given that the only advantage of doing this would be to handle a rare (I guess) corner case, that is, an out-of-sequence FIN segment with data, which is not critical anyway because the FIN will be retransmitted, I would rather keep this (critical) fix as it is. I would suggest filing a separate ticket or anyway sending a separate patch if you want to fix that other case. -- Stefano ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence 2025-10-06 22:32 ` Stefano Brivio @ 2025-10-06 23:34 ` David Gibson 0 siblings, 0 replies; 18+ messages in thread From: David Gibson @ 2025-10-06 23:34 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 6763 bytes --] On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote: > On Fri, 3 Oct 2025 13:43:32 +1000 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote: > > > On Thu, 2 Oct 2025 13:02:09 +1000 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote: > > > > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote: > > [snip] > > > > > > diff --git a/tcp.c b/tcp.c > > > > > > index 3f7dc82..5a7a607 100644 > > > > > > --- a/tcp.c > > > > > > +++ b/tcp.c > > > > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > > > } > > > > > > } > > > > > > > > > > > > - if (th->fin) > > > > > > + if (th->fin && seq == seq_from_tap) > > > > > > fin = 1; > > > > > > > > > > Can a FIN segment also contain data? My quick googling suggests yes. > > > > > > Yes, absolutely, my slow wiresharking over the years also confirms, and > > > it's so often the case that (I think) this issue doesn't happen so > > > frequently as it only occurs if we have a FIN segment without data. > > > > Makes sense. > > > > > If we have a data segment, with FIN set, that we can't fully transmit, > > > we already set 'partial_send' and won't set TAP_FIN_RCVD as a result. > > > > > > Another case where we want to ignore a FIN segment with data is if we > > > have a gap before it, but in that case we'll eventually set 'keep' and > > > return early. > > > > Ah, right. I'd noticed we set fin = 1 in that case, but forgotten > > about the exit before setting TAP_FIN_RCVD if keep is set. > > > > > > > If so, doesn't this logic need to go after we process the data > > > > > processing, so that seq_from_tap points to the end of the packet's > > > > > data, rather than the beginning? (And the handling of zero-length > > > > > packets would also need revision to match). > > > > > > This made sense to me for a moment but now I'm struggling to understand > > > or remember why. What I want to check here is that a FIN segment > > > without data (I should have specified in the commit message) is > > > acceptable because its sequence is as expected. > > > > Right. This is correct for zero-data FIN segments, but I think as a > > side-effect you've made it ignore certain FIN segments _with_ data. > > It will work in the common case where the data exactly follows on from > > what we already have. But in the case where the segment has some data > > we already have and some new data, the fin = 1 won't trip because seq > > != seq_from_tap. There isn't another place that will catch it > > instead, AFAICT. > > > > I guess it will be fine in the end, because with all the data acked, > > the guest should retransmit the FIN with no data. > > > > > But going back to FIN segments with data: why should we sum the length > > > to seq_from_tap before comparing the sequence? I don't understand what > > > additional check you want to introduce, or what case you want to cover. > > > > I was thinking about the case above, but I didn't explain it very > > well. > > > > > > Following on from that, it seems to me like it would make sense for > > > > FIN segments to also participate in the 'keep' mechanism. It should > > > > work eventually, but I expect it would be smoother in the case that we > > > > get a final burst of packets in a stream out of order. > > > > > > FIN segments with data already go through that dance. > > > > > > Without data, I guess you're right, we might have in the same batch > > > (not that I've ever seen it happening in practice) a FIN segment > > > without data that we process first (and now discard because of the > > > sequence number), and some data before that we process later, so we > > > shouldn't throw away the FIN segment because of that. We should, > > > conceptually, reorder it as well. > > > > > > It probably makes things more complicated for a reason that's not so > > > critical (ignoring a FIN is fine, we'll get another one), and I wanted > > > to have the simplest possible fix here. > > > > > > Let me see if I can make this entirely correct without a substantially > > > bigger change, I haven't really tried. > > > > How about this: > > > > diff --git a/tcp.c b/tcp.c > > index 7da41797..42e576b4 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > } > > } > > > > - if (th->fin) > > - fin = 1; > > - > > - if (!len) > > + if (!len && !th->fin) > > continue; > > > > seq_offset = seq_from_tap - seq; > > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > break; > > seq_from_tap += size; > > iov_i += count; > > + if (th->fin) > > + fin = 1; > > > > if (keep == i) > > keep = -1; > > > > > > We'd need to double check that the "accept data segment" path is safe > > with len == 0, of course. > > For sure it's not before d2c33f45f7be ("tcp: Convert > tcp_data_from_tap() to use iov_tail"), because we might add > zero-length segments to the tcp_iov array, and that would make > backporting an otherwise simple and critical fix to slightly older > versions rather complicated. Kinda. It's not that complicated to deal with that case, by wrapping the actual data processing in an `if (len) { ... }` > After that commit, I'm not sure about the behaviour of > iov_tail_clone(). I think it will return 0, but it should be tested > (that assumption, itself, but also that the fix still works). Right, I think it's safe in that case - that's what I was looking at. > > But I think that will treat dataless and > > with-data FINs the same way, and let them use the keep mechanism. > > Given that the only advantage of doing this would be to handle a rare > (I guess) corner case, that is, an out-of-sequence FIN segment with > data, which is not critical anyway because the FIN will be > retransmitted, I would rather keep this (critical) fix as it is. > > I would suggest filing a separate ticket or anyway sending a separate > patch if you want to fix that other case. Fair enough. I'll wait for you to get the basic fix merge, then I might batch this cleanup/fixup with the reorg to clarify bytes_acked handling. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away 2025-10-02 0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio ` (2 preceding siblings ...) 2025-10-02 0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio @ 2025-10-02 0:06 ` Stefano Brivio 2025-10-02 3:00 ` David Gibson 3 siblings, 1 reply; 18+ messages in thread From: Stefano Brivio @ 2025-10-02 0:06 UTC (permalink / raw) To: passt-dev ...possibly with an updated window value. As we're discarding the remaining data, we'll need to receive it again. If we don't request a retransmission immediately, we'll see an apparent gap in the sequence, and request a retransmission on the next data batch or segment, but we're just wasting time like that. In packets: 28686 0.000007 88.198.0.164 → 93.235.151.95 16118 TCP 55414 → 47080 [ACK] Seq=80501 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU] 28687 0.000012 88.198.0.164 → 93.235.151.95 16118 TCP [TCP Window Full] 55414 → 47080 [ACK] Seq=96565 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU] on this second data segment, we have a short sendmsg(), and 28688 0.000026 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0 consequently acknowledge it, without requesting a retransmission, 28689 0.000006 88.198.0.164 → 93.235.151.95 8866 HTTP HTTP/1.1 200 ok (text/css) so the server proceeds sending a bit more, but 28690 0.000016 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#1] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0 28691 0.000000 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#2] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0 we'll throw that data (from frame #28689) away, and finally request a retransmission as we spotted the gap now. Request a retransmission as soon as we know we'll need it, instead. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> --- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index 5a7a607..ae5e7a1 100644 --- a/tcp.c +++ b/tcp.c @@ -1876,7 +1876,7 @@ eintr: } out: - if (keep != -1) { + if (keep != -1 || partial_send) { /* We use an 8-bit approximation here: the associated risk is * that we skip a duplicate ACK on 8-bit sequence number * collision. Fast retransmit is a SHOULD in RFC 5681, 3.2. -- 2.43.0 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away 2025-10-02 0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio @ 2025-10-02 3:00 ` David Gibson 0 siblings, 0 replies; 18+ messages in thread From: David Gibson @ 2025-10-02 3:00 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 2435 bytes --] On Thu, Oct 02, 2025 at 02:06:46AM +0200, Stefano Brivio wrote: > ...possibly with an updated window value. As we're discarding the > remaining data, we'll need to receive it again. If we don't request > a retransmission immediately, we'll see an apparent gap in the > sequence, and request a retransmission on the next data batch or > segment, but we're just wasting time like that. In packets: > > 28686 0.000007 88.198.0.164 → 93.235.151.95 16118 TCP 55414 → 47080 [ACK] Seq=80501 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU] > 28687 0.000012 88.198.0.164 → 93.235.151.95 16118 TCP [TCP Window Full] 55414 → 47080 [ACK] Seq=96565 Ack=141 Win=65536 Len=16064 [TCP segment of a reassembled PDU] > > on this second data segment, we have a short sendmsg(), and > > 28688 0.000026 93.235.151.95 → 88.198.0.164 54 TCP 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0 > > consequently acknowledge it, without requesting a retransmission, > > 28689 0.000006 88.198.0.164 → 93.235.151.95 8866 HTTP HTTP/1.1 200 ok (text/css) > > so the server proceeds sending a bit more, but > > 28690 0.000016 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#1] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0 > 28691 0.000000 93.235.151.95 → 88.198.0.164 54 TCP [TCP Dup ACK 28688#2] 47080 → 55414 [ACK] Seq=141 Ack=90721 Win=32128 Len=0 > > we'll throw that data (from frame #28689) away, and finally request > a retransmission as we spotted the gap now. > > Request a retransmission as soon as we know we'll need it, instead. > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> > --- > tcp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tcp.c b/tcp.c > index 5a7a607..ae5e7a1 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1876,7 +1876,7 @@ eintr: > } > > out: > - if (keep != -1) { > + if (keep != -1 || partial_send) { > /* We use an 8-bit approximation here: the associated risk is > * that we skip a duplicate ACK on 8-bit sequence number > * collision. Fast retransmit is a SHOULD in RFC 5681, 3.2. > -- > 2.43.0 > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-10-06 23:34 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-10-02 0:06 [PATCH 0/4] tcp: Fix bad switch to CLOSE-WAIT state and surrounding issues Stefano Brivio 2025-10-02 0:06 ` [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap Stefano Brivio 2025-10-02 2:41 ` David Gibson 2025-10-02 11:58 ` Stefano Brivio 2025-10-03 3:19 ` David Gibson 2025-10-06 22:32 ` Stefano Brivio 2025-10-06 23:31 ` David Gibson 2025-10-02 0:06 ` [PATCH 2/4] tcp: Completely ignore data segment in CLOSE-WAIT state, log a message Stefano Brivio 2025-10-02 2:44 ` David Gibson 2025-10-02 0:06 ` [PATCH 3/4] tcp: Don't consider FIN flags with mismatching sequence Stefano Brivio 2025-10-02 2:52 ` David Gibson 2025-10-02 3:02 ` David Gibson 2025-10-02 11:51 ` Stefano Brivio 2025-10-03 3:43 ` David Gibson 2025-10-06 22:32 ` Stefano Brivio 2025-10-06 23:34 ` David Gibson 2025-10-02 0:06 ` [PATCH 4/4] tcp: On partial send (incomplete sendmsg()), request a retransmission right away Stefano Brivio 2025-10-02 3:00 ` David Gibson
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).