From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202512 header.b=l3XketQb; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 6D95A5A0271 for ; Fri, 30 Jan 2026 01:56:14 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202512; t=1769734570; bh=99k89A5I0cX6igNTTFa3O905Oi3pV+bya8RTovlwjnM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=l3XketQbZEyCcvr6zIdgLnaLOMOLb5RxR1sahit15T5AWgc3EfD8bOysLHqSn8htq Ys3KmuXu7jTWkZCpYOkCKsamHCmsrnnxjhlHzX4z7hjQxwv6iL6HGqapsy/RPYFKqM 6jVUQ6XS74qkP31tZwrQYNgJ5zlDBLETDu9Cc686dBouvN0ktb2gid5LzpK5YrIpGn LfZapxm6i8NOvqaOtP3oHHPF+qmRrbO2RCvr94H3T42RH+8JJTqldfFIAExVF3gaXI ka20gOhse3wcErzHfiVWTl2W260U3V7WwHtWcQ3QirmZeLpgR+xSEkPGYlJDa3MyUT Pt3Eu/b83jq9A== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4f2Hd61j6Bz4wC8; Fri, 30 Jan 2026 11:56:10 +1100 (AEDT) Date: Fri, 30 Jan 2026 11:56:06 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 2/2] tcp: Eliminate FIN_TIMEOUT Message-ID: References: <20260129055355.1229700-1-david@gibson.dropbear.id.au> <20260129055355.1229700-3-david@gibson.dropbear.id.au> <20260129183330.5632ceff@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="fExY61G8fULuKUcT" Content-Disposition: inline In-Reply-To: <20260129183330.5632ceff@elisabeth> Message-ID-Hash: MOBPW5HEN5VTZKQ56YCJUKWXVMRV3N54 X-Message-ID-Hash: MOBPW5HEN5VTZKQ56YCJUKWXVMRV3N54 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --fExY61G8fULuKUcT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 29, 2026 at 06:33:44PM +0100, Stefano Brivio wrote: > On Thu, 29 Jan 2026 16:53:55 +1100 > David Gibson wrote: >=20 > > From the name, and it's value of 60s, FIN_TIMEOUT appears to be defined > > as an equivalent to the kernel's net.ipv4.tcp_fin_timeout sysctl, which > > controls how long the kernel will keep an orphaned (that is, closed by > > userspace) socket which is in FIN_WAIT_2 state. >=20 > Not really, that wasn't the intention. As you noted, we couldn't even > implement that. Ok, I was misled by the name (so even if we keep something, we should rename it). >=20 > > The theory of operation describes FIN_TIMEOUT thus: > >=20 > > - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a FIN > > segment (write shutdown) was sent via socket (events SOCK_FIN_SENT and > > TAP_FIN_ACKED), but no socket activity is detected from the socket wi= thin > > this time, reset the connection >=20 > This was the intention, and it used to match the implementation. The > equivalent state would be FIN_WAIT_1. No. SOCK_FIN_SENT | TAP_FIN_ACKED implies we have sent FINs to both sides. That also implies we've received FINs from both sides (receiving a FIN on one side is the only thing that will trigger sending a FIN on the other side). That means we can't be in FIN_WAIT_1 w.r.t. either the tap side or sock side connection. On the tap side, we must be in TIME_WAIT (or CLOSED) state. On the socket side we could be in either CLOSING or LAST_ACK state. > I accidentally dropped this case > in commit be5bbb9b0681 ("tcp: Rework timers to use timerfd instead of > periodic bitmap scan"): >=20 > if (conn->events & SOCK_FIN_SENT && sock_act > FIN_TIMEOUT) > goto rst; >=20 > that is, if we detected the last socket activity (including EPOLLHUP) > more than FIN_TIMEOUT seconds ago, and we (presumably) sent a FIN segment > via the socket (calling shutdown()), we would reset the connection. I'm still not clear what this is trying to do. The point here seems to be that we've sent a FIN on the socket side, but it's not acknowledged. Since it _is_ the socket side, AFAICT it's the kernel's problem to follow up at this point. > > This description does not match what net.ipv4.tcp_fin_timeout does. The > > test for SOCK_FIN_SENT does not check if we are in FIN_WAIT_2 or for an > > orphaned socket. In fact SOCK_FIN_SENT means we *cannot* be in FIN_WAI= T_2 > > state (w.r.t. the tap side): we set it when we shutdown(SHUT_WR) the so= cket > > connection. We do that only when we receive a FIN from the tap side, a= nd > > the TCP state diagram does not allow us to be in FIN_WAIT_2 state if we > > already have a FIN from our peer. > >=20 > > The description also doesn't match what the code does: in tcp_timer_ctl= () > > we only set FIN_TIMEOUT on our timer when when ACK_FROM_TAP_DUE is unse= t, > > but we only act on the FIN_TIMEOUT if ACK_FROM_TAP_DUE *is* set. > >=20 > > In fact it's impossible for us to implement something like > > net.ipv4.tcp_fin_timeout, because recognizing an orphaned socket requir= es > > out of band information (the fact the socket has been closed) that an > > endpoint kernel has, but is not visible to us (via either tap or socket > > interface). Therefore, entirely remove the FIN_TIMEOUT related logic. > >=20 > > Signed-off-by: David Gibson > > --- > > tcp.c | 14 -------------- > > 1 file changed, 14 deletions(-) > >=20 > > diff --git a/tcp.c b/tcp.c > > index dbfde2e0..0be871a4 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -190,11 +190,6 @@ > > * - RTO_INIT_AFTER_SYN_RETRIES: if SYN retries happened during handsh= ake and > > * RTO is less than this, re-initialise RTO to this for data retrans= missions > > * > > - * - FIN_TIMEOUT: if a FIN segment was acknowledged by tap/guest and a= FIN > > - * segment (write shutdown) was sent via socket (events SOCK_FIN_SEN= T and > > - * TAP_FIN_ACKED), but no socket activity is detected from the socke= t within > > - * this time, reset the connection >=20 > I think the patch is correct, strictly speaking, but we should > implement what's described here again. We won't otherwise have any > mechanism to reset connections() if we call shutdown(), and no further > events happen are reported by the socket. Ok, so the case we're looking for is a kernel bug where it doesn't follow up on its unacknowledged FIN to the peer. It's not clear to me that we need to bother working around that bug at all. Even if we do, we may not need a timer - we can (and arguably should, not sure if we do) delay our ACK of the guest's FIN until our FIN to the peer is acknowledged (EPOLLHUP). That way the guest will resend FINs which will remind us to check up on the socket side. > Should I apply the series anyway, or wait? Maybe apply 1/2? I think it's a correct fix independent of this one. > > - * > > * - ACT_TIMEOUT, in the presence of any event: if no activity is dete= cted on > > * either side, the connection is reset > > * > > @@ -341,7 +336,6 @@ enum { > > =20 > > #define RTO_INIT 1 /* s, RFC 6298 */ > > #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ > > -#define FIN_TIMEOUT 60 > > #define ACT_TIMEOUT 7200 > > =20 > > #define LOW_RTT_TABLE_SIZE 8 > > @@ -594,8 +588,6 @@ static void tcp_timer_ctl(const struct ctx *c, stru= ct tcp_tap_conn *conn) > > timeout =3D MAX(timeout, RTO_INIT_AFTER_SYN_RETRIES); > > timeout <<=3D MAX(exp, 0); > > it.it_value.tv_sec =3D MIN(timeout, c->tcp.rto_max); > > - } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > - it.it_value.tv_sec =3D FIN_TIMEOUT; > > } else { > > it.it_value.tv_sec =3D ACT_TIMEOUT; > > } > > @@ -715,9 +707,6 @@ void conn_event_do(const struct ctx *c, struct tcp_= tap_conn *conn, > > flow_hash_remove(c, TAP_SIDX(conn)); > > tcp_epoll_ctl(conn); > > } > > - > > - if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) > > - tcp_timer_ctl(c, conn); > > } > > =20 > > /** > > @@ -2590,9 +2579,6 @@ void tcp_timer_handler(const struct ctx *c, union= epoll_ref ref) > > conn_flag(c, conn, SYN_RETRIED); > > tcp_timer_ctl(c, conn); > > } > > - } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > - flow_dbg(conn, "FIN timeout"); > > - tcp_rst(c, conn); > > } else if (conn->retries =3D=3D TCP_MAX_RETRIES) { > > flow_dbg(conn, "retransmissions count exceeded"); > > tcp_rst(c, conn); >=20 > --=20 > Stefano >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --fExY61G8fULuKUcT Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAml8AZkACgkQzQJF27ox 2Gf/SQ/9F4KKkwZoAWosn6SoL7KEAq5i5MBSkTpYymT4fzDKkId5+qMWPzSdVTK8 wiGz2omyh0L84shgZQUcZ/3miHjW1P6p1JaVD+P9jQGNst4rFM6XWn14uc/RzPsK xR/7C6n+N8kwulu8rqPFUtfKWjgcnV8ni1k0kYN053ORPa/7nLUzQJyiNySa6cLz ZJZFbX4yzYUPNy+6bwiMcZXbBLixYLhmvf8O2ZMyg6Nl7GNIVllc4bWhJyDMqbUf oT8P2ozUO6cDCImmzZeGT9xftF40B8+KFqIzdzycKEu854Az7e/titiWX7+icd5j MsfuqD2w9dOL+QnTWazvHnH/oPGGpfQHlo9QscNH0p0v51mS76KNBvXBrEaPEkFW nifpjx4vGL1IkeEYu8+4A0iTa2AEGX7qugLAprLONpUIZUW4bw+79mzdYzQUxEIu 3VkcPkv/7K0ZnFnnb8/7l203HrUGtS8ka8oT/99001VH+fo6GQrUMvJ7hdqbVtl0 6+x3Wnbgmd6rNjN7ZQds+Jnn0/5d+n3xDAIG7Jk+i+zXLtC47Z/Uijj7yIa+tDYM hMY/y6G2xugMoKKOyMYKUVx6e39JRHyhBnoEB4hdzAEWtSAw5D7HJXXYt3zFght9 E7v9jeOhVxSmBPtOxLiW3e5E43ZHrtsfI3hf1zpxe6vLFb87LKo= =rjrE -----END PGP SIGNATURE----- --fExY61G8fULuKUcT--