From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202510 header.b=HCCg8tbz; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 518D05A026F for ; Wed, 05 Nov 2025 05:24:54 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202510; t=1762316692; bh=NaFfcalyt+fa6UiazqmkVY0psJ+IPfdff4Yub7tVyQI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=HCCg8tbz6fXn2a7cJPJCYDRt18XWPYYIGMh6XaWlFjZVRaDS3LxNVoUCZVU4dyvan hNXOmvgn0SWdYsiQS+pZ6PzmwLChuof15LnrtQMnMt5ACNeFRRAuor1ZuNUaD7HHLH 3ifkn6rj292HkU12kxhQTJeYX+WmpRqNHjxHw4n6cfemNRS1mTBLXj4ydvueCbvUst 8Fjta/po1KCV1ikYsCIcHN05SviqHeMZuhsRT74ZVSL1fX9edFhUnRFbesGJUiVNwG OR9sMXJ7TuKl3dxMYQrERywCDw95t8yriBkEJaTDqr+G9TYWYnc7rBuffBet32WGIh 8Iv/2OyhOa5bw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4d1XKc0yP9z4w2R; Wed, 05 Nov 2025 15:24:52 +1100 (AEDT) Date: Wed, 5 Nov 2025 15:22:27 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v7 5/5] tcp: Clamp the retry timeout Message-ID: References: <20251031054242.7334-1-yuhuang@redhat.com> <20251031054242.7334-6-yuhuang@redhat.com> <20251103113857.47f7d008@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ezLhYCsDLeyhitSv" Content-Disposition: inline In-Reply-To: <20251103113857.47f7d008@elisabeth> Message-ID-Hash: SUESD3QW6GATHU3J7KTVWGLPZX7ZJREY X-Message-ID-Hash: SUESD3QW6GATHU3J7KTVWGLPZX7ZJREY X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Yumei Huang , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --ezLhYCsDLeyhitSv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Nov 03, 2025 at 11:38:57AM +0100, Stefano Brivio wrote: > On Mon, 3 Nov 2025 12:37:52 +1100 > David Gibson wrote: >=20 > > On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote: > > > Clamp the TCP retry timeout as Linux kernel does. If RTO is less > > > than 3 seconds, re-initialize it to 3 seconds for data retransmissions > > > according to RFC 6298. > > >=20 > > > Suggested-by: Stefano Brivio > > > Signed-off-by: Yumei Huang > > > --- > > > tcp.c | 25 ++++++++++++++++++++----- > > > tcp.h | 2 ++ > > > 2 files changed, 22 insertions(+), 5 deletions(-) > > >=20 > > > diff --git a/tcp.c b/tcp.c > > > index 96ee56a..84a6700 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -187,6 +187,9 @@ > > > * for established connections, or (tcp_syn_retries + > > > * tcp_syn_linear_timeouts) times during the handshake, reset the = connection > > > * > > > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO t= o this for > > > + * data retransmissions. > > > + * > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_F= ROM_TAP_DUE > > > * with TAP_FIN_SENT event), and no ACK is received within this ti= me, reset > > > * the connection > > > @@ -340,6 +343,7 @@ enum { > > > =20 > > > #define ACK_INTERVAL 10 /* ms */ > > > #define RTO_INIT 1 /* s, RFC 6298 */ > > > +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ > > > #define FIN_TIMEOUT 60 > > > #define ACT_TIMEOUT 7200 > > > =20 > > > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_= QUEUE_MAX]; > > > =20 > > > #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" > > > #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_t= imeouts" > > > +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms" > > > =20 > > > #define TCP_SYN_RETRIES_DEFAULT 6 > > > #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 > > > +#define TCP_RTO_MAX_MS_DEFAULT 120000 > > > =20 > > > /* "Extended" data (not stored in the flow table) for TCP flow migra= tion */ > > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; > > > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, = struct tcp_tap_conn *conn) > > > if (conn->flags & ACK_TO_TAP_DUE) { > > > it.it_value.tv_nsec =3D (long)ACK_INTERVAL * 1000 * 1000; > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > - int exp =3D conn->retries; > > > + int exp =3D conn->retries, timeout =3D RTO_INIT; > > > if (!(conn->events & ESTABLISHED)) > > > exp -=3D c->tcp.syn_linear_timeouts; > > > - it.it_value.tv_sec =3D RTO_INIT << MAX(exp, 0); > > > + else > > > + timeout =3D MAX(timeout, RTO_INIT_ACK); =20 > >=20 > > Possibly I missed something, since I only skimmed your discussion of > > this behaviour with Stefano. But I'm not convinced this is a correct > > interpretation of the RFC. (5.7) says "If the timer expires awaiting > > the ACK of a SYN segment ...". That is, I think it's only suggesting > > increasing the RTO to 3 at the data stage *if* we had at least one > > retry during the handshake. >=20 > Oops, true, my bad. >=20 > I guess I suggested the interpretation as of v7 because I just skimmed > Appendix A of RFC 6298 whose main function is to justify the reasons > behind lowering the initial timeout to one second, and I thought these > reason simply don't apply to the established phase, so we use three > seconds there. >=20 > But that's clearly not the case, hence the "When this happens" clause > in the middle of Appendix A. >=20 > > That is, unfortunately, much fiddlier to > > implement, since we need to remember what happened during the > > handshake to apply it here. >=20 > Hmm, >=20 > --- > diff --git a/tcp.c b/tcp.c > index c1eb5de..90c3ca1 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -397,7 +397,7 @@ static const char *tcp_state_str[] __attribute((__unu= sed__)) =3D { > =20 > static const char *tcp_flag_str[] __attribute((__unused__)) =3D { > "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", > - "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", > + "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", "SYN_RETRIED", > }; > =20 > /* Listening sockets, used for automatic port forwarding in pasta mode o= nly */ > @@ -597,7 +597,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn) > int exp =3D conn->retries, timeout =3D RTO_INIT; > if (!(conn->events & ESTABLISHED)) > exp -=3D c->tcp.syn_linear_timeouts; > - else > + else if (conn->flags & SYN_RETRIED) > timeout =3D MAX(timeout, RTO_INIT_ACK); > timeout <<=3D MAX(exp, 0); > it.it_value.tv_sec =3D MIN(timeout, c->tcp.tcp_rto_max); > @@ -2446,6 +2446,7 @@ void tcp_timer_handler(const struct ctx *c, union e= poll_ref ref) > flow_trace(conn, "SYN timeout, retry"); > tcp_send_flag(c, conn, SYN); > conn->retries++; > + conn_flag(c, conn, SYN_RETRIED); > tcp_timer_ctl(c, conn); > } > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > diff --git a/tcp_conn.h b/tcp_conn.h > index c006d56..87f4a2d 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -77,6 +77,7 @@ struct tcp_tap_conn { > #define ACK_TO_TAP_DUE BIT(3) > #define ACK_FROM_TAP_DUE BIT(4) > #define ACK_FROM_TAP_BLOCKS BIT(5) > +#define SYN_RETRIED BIT(6) > =20 > #define SNDBUF_BITS 24 > unsigned int sndbuf :SNDBUF_BITS; > --- >=20 > ? Ok, not as fiddly as I feared. \o/ > > Additionally, if I'm reading the RFC correctly, it's treating this as > > a one-time adjustment of the RTO, which won't necessarily remain the > > case for the entire data phase. Here this minimum will apply for the > > entire data phase. >=20 > But it's the initial RTO (see Appendix A, which states it clearly), and > any exponentiation is based on the initial value, so this should fit > the requirement. >=20 > > Even though it's a "MUST" in the RFC, I kind of think we could just > > skip this for two reasons: > >=20 > > 1) We already don't bother with RTT measurements, which the RFC > > assumes the implementation is doing to adjust the RTO. >=20 > Kind of: >=20 > (2.1) Until a round-trip time (RTT) measurement has been made for a > segment sent between the sender and receiver, the sender SHOULD > set RTO <- 1 second >=20 > and given that this condition (no round-trip time measurement done yet) > is explicitly considered, I guess we can reasonably expect TCP stacks we > might be talking to to be fully compatible with what we're doing, as > long as we stick to the RFC. Good points, you convinced me. > > 2) We expect to be talking to a guest. The chance of a high RTT is > > vanishingly small compared to a path to potentially any host on > > the 2011 internet. >=20 > ...until we move the guest / container and we implement a TCP transport > (somewhat overdue) in place of the existing tap / UNIX domain / > vhost-user connection. Right. Or even right now, if the guest then NATs the connection onto a slow VPN link. > The chance is still small and the consequences of ignoring this part of > the RFC are, I'm fairly sure, negligible, but if it's that easy to > implement, should we really depart from it? >=20 > --=20 > Stefano >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --ezLhYCsDLeyhitSv Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmkK0QIACgkQzQJF27ox 2Gdgew/8CFO6+Ri3rUWdM/XMuYCnByhTqFHScV/ZdOzSmVdi5YfWxcY7Ahp97740 RI1N+NgMNQgvvhrKq0PpsKaeHNr65LJlAm8SDmJkXf2VIRxmyFjfr0Zv4iPRNG0M EJDj2+yPnHkSPVTlmwtQ4BqbmW9DDZ+sZndNsKVRuHgfTI4fEKdDJWrlERy2bRCS LkzX57rp2XwvRaYUkqAj/FIV7hpf71d33ZjvfucqqLKS3C3mn5G9wSP1Tz9z+ezL RIaLVKKiYg6ptQxJ18kDPgZ2BIHgqyiyYn5LQFBLxXEgLXKhaza25ZQfL2JsACXG nPQ4qHdfrRrf0N74+H5rC7CySb08LgEZz+TCTlGsw2OSXOzNOIviZSY0fE2P0HAs cJddSJ4WRScPmn9fKZ6qYO/FN4CDzgbFpnomApDJkVqDf9hY5U2bkAA/zaDtJwVb K67WnegPtRCCCwkxNT7p8zHktNWjpoBkcJU1mAXmh3RDRB6gBOVTHayBYoqNwouq EiQf+duu6mPODAtor1QiYlMEtXXhVOLhIIuCKlxNfmuHKbvT2+KtvttcYoDBiFLV CHrTQaVkSK1h1yYMwp2OwK5xxVpaubCuMx3S73iZK0DYMFVZqTk+ZMN8WAkWL5wk cq+MGj8+XkamXuQJ3VS/2cQky5QW1m29KElsfHOSL0TN8tNEKnA= =HBhP -----END PGP SIGNATURE----- --ezLhYCsDLeyhitSv--