* [PATCH 0/2] Fix errors in FIN timeout logic
@ 2026-01-29 5:53 David Gibson
2026-01-29 5:53 ` [PATCH 1/2] tcp: Retransmit FINs like data segments David Gibson
2026-01-29 5:53 ` [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT David Gibson
0 siblings, 2 replies; 5+ messages in thread
From: David Gibson @ 2026-01-29 5:53 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
While investigating bug 179, I found a number of things that confused
me about the TCP timer handling. One of them, I think I figured out
what's going on and what should be done about it. So, here are the
changes. This is mostly about FIN handling and only tangentially
about the timer, but it does at least slightly simplify the timer
handling while I figure out the rest of it.
David Gibson (2):
tcp: Retransmit FINs like data segments
tcp: Eliminate FIN_TIMEOUT
tcp.c | 18 ------------------
tcp_buf.c | 1 +
tcp_vu.c | 1 +
3 files changed, 2 insertions(+), 18 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/2] tcp: Retransmit FINs like data segments
2026-01-29 5:53 [PATCH 0/2] Fix errors in FIN timeout logic David Gibson
@ 2026-01-29 5:53 ` David Gibson
2026-01-29 5:53 ` [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT David Gibson
1 sibling, 0 replies; 5+ messages in thread
From: David Gibson @ 2026-01-29 5:53 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
RFC 9293 doesn't distinguish between regular data segments and FIN segments
for the purposes of retransmissions. Our existing retransmission logic
will also work for FIN segments, except for one detail: we don't currently
set the ACK_FROM_TAP_DUE flag when we send a FIN. Add the flag, so that
we'll properly retransmit FIN segments like data segments.
Remove the section from the theory of operation comment that describes a
different way of handling FIN timeouts which (a) isn't correct behaviour
and (b) doesn't appear to be implemented.
I've tested this by adding logic to suppress sending the FIN if retries <
some non-zero value. We correctly resend the FIN and close normally after
the expected timeouts.
Link: https://bugs.passt.top/show_bug.cgi?id=195
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 4 ----
tcp_buf.c | 1 +
tcp_vu.c | 1 +
3 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/tcp.c b/tcp.c
index 17e5b006..dbfde2e0 100644
--- a/tcp.c
+++ b/tcp.c
@@ -190,10 +190,6 @@
* - RTO_INIT_AFTER_SYN_RETRIES: if SYN retries happened during handshake and
* RTO is less than this, re-initialise RTO to this for data retransmissions
*
- * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
- * with TAP_FIN_SENT event), and no ACK is received within this time, reset
- * the connection
- *
* - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
* segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
* TAP_FIN_ACKED), but no socket activity is detected from the socket within
diff --git a/tcp_buf.c b/tcp_buf.c
index 5d419d36..d2925410 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -407,6 +407,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
}
conn_event(c, conn, TAP_FIN_SENT);
+ conn_flag(c, conn, ACK_FROM_TAP_DUE);
}
return 0;
diff --git a/tcp_vu.c b/tcp_vu.c
index db9db78a..b9e9b55e 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -425,6 +425,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
}
conn_event(c, conn, TAP_FIN_SENT);
+ conn_flag(c, conn, ACK_FROM_TAP_DUE);
}
return 0;
--
2.52.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT
2026-01-29 5:53 [PATCH 0/2] Fix errors in FIN timeout logic David Gibson
2026-01-29 5:53 ` [PATCH 1/2] tcp: Retransmit FINs like data segments David Gibson
@ 2026-01-29 5:53 ` David Gibson
2026-01-29 17:33 ` Stefano Brivio
1 sibling, 1 reply; 5+ messages in thread
From: David Gibson @ 2026-01-29 5:53 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
From the name, and it's value of 60s, FIN_TIMEOUT appears to be defined
as an equivalent to the kernel's net.ipv4.tcp_fin_timeout sysctl, which
controls how long the kernel will keep an orphaned (that is, closed by
userspace) socket which is in FIN_WAIT_2 state.
The theory of operation describes FIN_TIMEOUT thus:
- FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
TAP_FIN_ACKED), but no socket activity is detected from the socket within
this time, reset the connection
This description does not match what net.ipv4.tcp_fin_timeout does. The
test for SOCK_FIN_SENT does not check if we are in FIN_WAIT_2 or for an
orphaned socket. In fact SOCK_FIN_SENT means we *cannot* be in FIN_WAIT_2
state (w.r.t. the tap side): we set it when we shutdown(SHUT_WR) the socket
connection. We do that only when we receive a FIN from the tap side, and
the TCP state diagram does not allow us to be in FIN_WAIT_2 state if we
already have a FIN from our peer.
The description also doesn't match what the code does: in tcp_timer_ctl()
we only set FIN_TIMEOUT on our timer when when ACK_FROM_TAP_DUE is unset,
but we only act on the FIN_TIMEOUT if ACK_FROM_TAP_DUE *is* set.
In fact it's impossible for us to implement something like
net.ipv4.tcp_fin_timeout, because recognizing an orphaned socket requires
out of band information (the fact the socket has been closed) that an
endpoint kernel has, but is not visible to us (via either tap or socket
interface). Therefore, entirely remove the FIN_TIMEOUT related logic.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 14 --------------
1 file changed, 14 deletions(-)
diff --git a/tcp.c b/tcp.c
index dbfde2e0..0be871a4 100644
--- a/tcp.c
+++ b/tcp.c
@@ -190,11 +190,6 @@
* - RTO_INIT_AFTER_SYN_RETRIES: if SYN retries happened during handshake and
* RTO is less than this, re-initialise RTO to this for data retransmissions
*
- * - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
- * segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
- * TAP_FIN_ACKED), but no socket activity is detected from the socket within
- * this time, reset the connection
- *
* - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
* either side, the connection is reset
*
@@ -341,7 +336,6 @@ enum {
#define RTO_INIT 1 /* s, RFC 6298 */
#define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */
-#define FIN_TIMEOUT 60
#define ACT_TIMEOUT 7200
#define LOW_RTT_TABLE_SIZE 8
@@ -594,8 +588,6 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
timeout = MAX(timeout, RTO_INIT_AFTER_SYN_RETRIES);
timeout <<= MAX(exp, 0);
it.it_value.tv_sec = MIN(timeout, c->tcp.rto_max);
- } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
- it.it_value.tv_sec = FIN_TIMEOUT;
} else {
it.it_value.tv_sec = ACT_TIMEOUT;
}
@@ -715,9 +707,6 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
flow_hash_remove(c, TAP_SIDX(conn));
tcp_epoll_ctl(conn);
}
-
- if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
- tcp_timer_ctl(c, conn);
}
/**
@@ -2590,9 +2579,6 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
conn_flag(c, conn, SYN_RETRIED);
tcp_timer_ctl(c, conn);
}
- } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
- flow_dbg(conn, "FIN timeout");
- tcp_rst(c, conn);
} else if (conn->retries == TCP_MAX_RETRIES) {
flow_dbg(conn, "retransmissions count exceeded");
tcp_rst(c, conn);
--
2.52.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT
2026-01-29 5:53 ` [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT David Gibson
@ 2026-01-29 17:33 ` Stefano Brivio
2026-01-30 0:56 ` David Gibson
0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2026-01-29 17:33 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Thu, 29 Jan 2026 16:53:55 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> From the name, and it's value of 60s, FIN_TIMEOUT appears to be defined
> as an equivalent to the kernel's net.ipv4.tcp_fin_timeout sysctl, which
> controls how long the kernel will keep an orphaned (that is, closed by
> userspace) socket which is in FIN_WAIT_2 state.
Not really, that wasn't the intention. As you noted, we couldn't even
implement that.
> The theory of operation describes FIN_TIMEOUT thus:
>
> - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
> segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
> TAP_FIN_ACKED), but no socket activity is detected from the socket within
> this time, reset the connection
This was the intention, and it used to match the implementation. The
equivalent state would be FIN_WAIT_1. I accidentally dropped this case
in commit be5bbb9b0681 ("tcp: Rework timers to use timerfd instead of
periodic bitmap scan"):
if (conn->events & SOCK_FIN_SENT && sock_act > FIN_TIMEOUT)
goto rst;
that is, if we detected the last socket activity (including EPOLLHUP)
more than FIN_TIMEOUT seconds ago, and we (presumably) sent a FIN segment
via the socket (calling shutdown()), we would reset the connection.
> This description does not match what net.ipv4.tcp_fin_timeout does. The
> test for SOCK_FIN_SENT does not check if we are in FIN_WAIT_2 or for an
> orphaned socket. In fact SOCK_FIN_SENT means we *cannot* be in FIN_WAIT_2
> state (w.r.t. the tap side): we set it when we shutdown(SHUT_WR) the socket
> connection. We do that only when we receive a FIN from the tap side, and
> the TCP state diagram does not allow us to be in FIN_WAIT_2 state if we
> already have a FIN from our peer.
>
> The description also doesn't match what the code does: in tcp_timer_ctl()
> we only set FIN_TIMEOUT on our timer when when ACK_FROM_TAP_DUE is unset,
> but we only act on the FIN_TIMEOUT if ACK_FROM_TAP_DUE *is* set.
>
> In fact it's impossible for us to implement something like
> net.ipv4.tcp_fin_timeout, because recognizing an orphaned socket requires
> out of band information (the fact the socket has been closed) that an
> endpoint kernel has, but is not visible to us (via either tap or socket
> interface). Therefore, entirely remove the FIN_TIMEOUT related logic.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 14 --------------
> 1 file changed, 14 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index dbfde2e0..0be871a4 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -190,11 +190,6 @@
> * - RTO_INIT_AFTER_SYN_RETRIES: if SYN retries happened during handshake and
> * RTO is less than this, re-initialise RTO to this for data retransmissions
> *
> - * - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
> - * segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
> - * TAP_FIN_ACKED), but no socket activity is detected from the socket within
> - * this time, reset the connection
I think the patch is correct, strictly speaking, but we should
implement what's described here again. We won't otherwise have any
mechanism to reset connections() if we call shutdown(), and no further
events happen are reported by the socket.
Should I apply the series anyway, or wait?
> - *
> * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> * either side, the connection is reset
> *
> @@ -341,7 +336,6 @@ enum {
>
> #define RTO_INIT 1 /* s, RFC 6298 */
> #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */
> -#define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> #define LOW_RTT_TABLE_SIZE 8
> @@ -594,8 +588,6 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> timeout = MAX(timeout, RTO_INIT_AFTER_SYN_RETRIES);
> timeout <<= MAX(exp, 0);
> it.it_value.tv_sec = MIN(timeout, c->tcp.rto_max);
> - } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> - it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
> it.it_value.tv_sec = ACT_TIMEOUT;
> }
> @@ -715,9 +707,6 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
> flow_hash_remove(c, TAP_SIDX(conn));
> tcp_epoll_ctl(conn);
> }
> -
> - if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
> - tcp_timer_ctl(c, conn);
> }
>
> /**
> @@ -2590,9 +2579,6 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> conn_flag(c, conn, SYN_RETRIED);
> tcp_timer_ctl(c, conn);
> }
> - } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> - flow_dbg(conn, "FIN timeout");
> - tcp_rst(c, conn);
> } else if (conn->retries == TCP_MAX_RETRIES) {
> flow_dbg(conn, "retransmissions count exceeded");
> tcp_rst(c, conn);
--
Stefano
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT
2026-01-29 17:33 ` Stefano Brivio
@ 2026-01-30 0:56 ` David Gibson
0 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2026-01-30 0:56 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 6770 bytes --]
On Thu, Jan 29, 2026 at 06:33:44PM +0100, Stefano Brivio wrote:
> On Thu, 29 Jan 2026 16:53:55 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > From the name, and it's value of 60s, FIN_TIMEOUT appears to be defined
> > as an equivalent to the kernel's net.ipv4.tcp_fin_timeout sysctl, which
> > controls how long the kernel will keep an orphaned (that is, closed by
> > userspace) socket which is in FIN_WAIT_2 state.
>
> Not really, that wasn't the intention. As you noted, we couldn't even
> implement that.
Ok, I was misled by the name (so even if we keep something, we should
rename it).
>
> > The theory of operation describes FIN_TIMEOUT thus:
> >
> > - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
> > segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
> > TAP_FIN_ACKED), but no socket activity is detected from the socket within
> > this time, reset the connection
>
> This was the intention, and it used to match the implementation. The
> equivalent state would be FIN_WAIT_1.
No. SOCK_FIN_SENT | TAP_FIN_ACKED implies we have sent FINs to both
sides. That also implies we've received FINs from both sides
(receiving a FIN on one side is the only thing that will trigger
sending a FIN on the other side). That means we can't be in
FIN_WAIT_1 w.r.t. either the tap side or sock side connection.
On the tap side, we must be in TIME_WAIT (or CLOSED) state. On the
socket side we could be in either CLOSING or LAST_ACK state.
> I accidentally dropped this case
> in commit be5bbb9b0681 ("tcp: Rework timers to use timerfd instead of
> periodic bitmap scan"):
>
> if (conn->events & SOCK_FIN_SENT && sock_act > FIN_TIMEOUT)
> goto rst;
>
> that is, if we detected the last socket activity (including EPOLLHUP)
> more than FIN_TIMEOUT seconds ago, and we (presumably) sent a FIN segment
> via the socket (calling shutdown()), we would reset the connection.
I'm still not clear what this is trying to do. The point here seems
to be that we've sent a FIN on the socket side, but it's not
acknowledged. Since it _is_ the socket side, AFAICT it's the kernel's
problem to follow up at this point.
> > This description does not match what net.ipv4.tcp_fin_timeout does. The
> > test for SOCK_FIN_SENT does not check if we are in FIN_WAIT_2 or for an
> > orphaned socket. In fact SOCK_FIN_SENT means we *cannot* be in FIN_WAIT_2
> > state (w.r.t. the tap side): we set it when we shutdown(SHUT_WR) the socket
> > connection. We do that only when we receive a FIN from the tap side, and
> > the TCP state diagram does not allow us to be in FIN_WAIT_2 state if we
> > already have a FIN from our peer.
> >
> > The description also doesn't match what the code does: in tcp_timer_ctl()
> > we only set FIN_TIMEOUT on our timer when when ACK_FROM_TAP_DUE is unset,
> > but we only act on the FIN_TIMEOUT if ACK_FROM_TAP_DUE *is* set.
> >
> > In fact it's impossible for us to implement something like
> > net.ipv4.tcp_fin_timeout, because recognizing an orphaned socket requires
> > out of band information (the fact the socket has been closed) that an
> > endpoint kernel has, but is not visible to us (via either tap or socket
> > interface). Therefore, entirely remove the FIN_TIMEOUT related logic.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp.c | 14 --------------
> > 1 file changed, 14 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index dbfde2e0..0be871a4 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -190,11 +190,6 @@
> > * - RTO_INIT_AFTER_SYN_RETRIES: if SYN retries happened during handshake and
> > * RTO is less than this, re-initialise RTO to this for data retransmissions
> > *
> > - * - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN
> > - * segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and
> > - * TAP_FIN_ACKED), but no socket activity is detected from the socket within
> > - * this time, reset the connection
>
> I think the patch is correct, strictly speaking, but we should
> implement what's described here again. We won't otherwise have any
> mechanism to reset connections() if we call shutdown(), and no further
> events happen are reported by the socket.
Ok, so the case we're looking for is a kernel bug where it doesn't
follow up on its unacknowledged FIN to the peer. It's not clear to me
that we need to bother working around that bug at all. Even if we do,
we may not need a timer - we can (and arguably should, not sure if we
do) delay our ACK of the guest's FIN until our FIN to the peer is
acknowledged (EPOLLHUP). That way the guest will resend FINs which
will remind us to check up on the socket side.
> Should I apply the series anyway, or wait?
Maybe apply 1/2? I think it's a correct fix independent of this one.
> > - *
> > * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> > * either side, the connection is reset
> > *
> > @@ -341,7 +336,6 @@ enum {
> >
> > #define RTO_INIT 1 /* s, RFC 6298 */
> > #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */
> > -#define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> >
> > #define LOW_RTT_TABLE_SIZE 8
> > @@ -594,8 +588,6 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > timeout = MAX(timeout, RTO_INIT_AFTER_SYN_RETRIES);
> > timeout <<= MAX(exp, 0);
> > it.it_value.tv_sec = MIN(timeout, c->tcp.rto_max);
> > - } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > - it.it_value.tv_sec = FIN_TIMEOUT;
> > } else {
> > it.it_value.tv_sec = ACT_TIMEOUT;
> > }
> > @@ -715,9 +707,6 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
> > flow_hash_remove(c, TAP_SIDX(conn));
> > tcp_epoll_ctl(conn);
> > }
> > -
> > - if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED))
> > - tcp_timer_ctl(c, conn);
> > }
> >
> > /**
> > @@ -2590,9 +2579,6 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > conn_flag(c, conn, SYN_RETRIED);
> > tcp_timer_ctl(c, conn);
> > }
> > - } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > - flow_dbg(conn, "FIN timeout");
> > - tcp_rst(c, conn);
> > } else if (conn->retries == TCP_MAX_RETRIES) {
> > flow_dbg(conn, "retransmissions count exceeded");
> > tcp_rst(c, conn);
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-01-30 0:56 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-29 5:53 [PATCH 0/2] Fix errors in FIN timeout logic David Gibson
2026-01-29 5:53 ` [PATCH 1/2] tcp: Retransmit FINs like data segments David Gibson
2026-01-29 5:53 ` [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT David Gibson
2026-01-29 17:33 ` Stefano Brivio
2026-01-30 0:56 ` David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).