On Mon, Nov 03, 2025 at 11:38:57AM +0100, Stefano Brivio wrote:
> On Mon, 3 Nov 2025 12:37:52 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
> > > Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> > > than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> > > according to RFC 6298.
> > > 
> > > Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > >  tcp.c | 25 ++++++++++++++++++++-----
> > >  tcp.h |  2 ++
> > >  2 files changed, 22 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index 96ee56a..84a6700 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -187,6 +187,9 @@
> > >   *   for established connections, or (tcp_syn_retries +
> > >   *   tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > >   *
> > > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
> > > + *   data retransmissions.
> > > + *
> > >   * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > >   *   with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > >   *   the connection
> > > @@ -340,6 +343,7 @@ enum {
> > >  
> > >  #define ACK_INTERVAL			10		/* ms */
> > >  #define RTO_INIT			1		/* s, RFC 6298 */
> > > +#define RTO_INIT_ACK			3		/* s, RFC 6298 */
> > >  #define FIN_TIMEOUT			60
> > >  #define ACT_TIMEOUT			7200
> > >  
> > > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue		[TCP_MIGRATE_RCV_QUEUE_MAX];
> > >  
> > >  #define TCP_SYN_RETRIES		"/proc/sys/net/ipv4/tcp_syn_retries"
> > >  #define TCP_SYN_LINEAR_TIMEOUTS	"/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > > +#define TCP_RTO_MAX_MS			"/proc/sys/net/ipv4/tcp_rto_max_ms"
> > >  
> > >  #define TCP_SYN_RETRIES_DEFAULT		6
> > >  #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT	4
> > > +#define TCP_RTO_MAX_MS_DEFAULT			120000
> > >  
> > >  /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > >  static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > >  		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > -		int exp = conn->retries;
> > > +		int exp = conn->retries, timeout = RTO_INIT;
> > >  		if (!(conn->events & ESTABLISHED))
> > >  			exp -= c->tcp.syn_linear_timeouts;
> > > -		it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > > +		else
> > > +			timeout = MAX(timeout, RTO_INIT_ACK);  
> > 
> > Possibly I missed something, since I only skimmed your discussion of
> > this behaviour with Stefano.  But I'm not convinced this is a correct
> > interpretation of the RFC.  (5.7) says "If the timer expires awaiting
> > the ACK of a SYN segment ...".  That is, I think it's only suggesting
> > increasing the RTO to 3 at the data stage *if* we had at least one
> > retry during the handshake.
> 
> Oops, true, my bad.
> 
> I guess I suggested the interpretation as of v7 because I just skimmed
> Appendix A of RFC 6298 whose main function is to justify the reasons
> behind lowering the initial timeout to one second, and I thought these
> reason simply don't apply to the established phase, so we use three
> seconds there.
> 
> But that's clearly not the case, hence the "When this happens" clause
> in the middle of Appendix A.
> 
> > That is, unfortunately, much fiddlier to
> > implement, since we need to remember what happened during the
> > handshake to apply it here.
> 
> Hmm,
> 
> ---
> diff --git a/tcp.c b/tcp.c
> index c1eb5de..90c3ca1 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -397,7 +397,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
>  
>  static const char *tcp_flag_str[] __attribute((__unused__)) = {
>  	"STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> -	"ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS",
> +	"ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", "SYN_RETRIED",
>  };
>  
>  /* Listening sockets, used for automatic port forwarding in pasta mode only */
> @@ -597,7 +597,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn)
>  		int exp = conn->retries, timeout = RTO_INIT;
>  		if (!(conn->events & ESTABLISHED))
>  			exp -= c->tcp.syn_linear_timeouts;
> -		else
> +		else if (conn->flags & SYN_RETRIED)
>  			timeout = MAX(timeout, RTO_INIT_ACK);
>  		timeout <<= MAX(exp, 0);
>  		it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> @@ -2446,6 +2446,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
>  				flow_trace(conn, "SYN timeout, retry");
>  				tcp_send_flag(c, conn, SYN);
>  				conn->retries++;
> +				conn_flag(c, conn, SYN_RETRIED);
>  				tcp_timer_ctl(c, conn);
>  			}
>  		} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> diff --git a/tcp_conn.h b/tcp_conn.h
> index c006d56..87f4a2d 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -77,6 +77,7 @@ struct tcp_tap_conn {
>  #define ACK_TO_TAP_DUE		BIT(3)
>  #define ACK_FROM_TAP_DUE	BIT(4)
>  #define ACK_FROM_TAP_BLOCKS	BIT(5)
> +#define SYN_RETRIED		BIT(6)
>  
>  #define SNDBUF_BITS		24
>  	unsigned int	sndbuf		:SNDBUF_BITS;
> ---
> 
> ?

Ok, not as fiddly as I feared.  \o/

> > Additionally, if I'm reading the RFC correctly, it's treating this as
> > a one-time adjustment of the RTO, which won't necessarily remain the
> > case for the entire data phase.  Here this minimum will apply for the
> > entire data phase.
> 
> But it's the initial RTO (see Appendix A, which states it clearly), and
> any exponentiation is based on the initial value, so this should fit
> the requirement.
> 
> > Even though it's a "MUST" in the RFC, I kind of think we could just
> > skip this for two reasons:
> > 
> >   1) We already don't bother with RTT measurements, which the RFC
> >      assumes the implementation is doing to adjust the RTO.
> 
> Kind of:
> 
>    (2.1) Until a round-trip time (RTT) measurement has been made for a
>          segment sent between the sender and receiver, the sender SHOULD
>          set RTO <- 1 second
> 
> and given that this condition (no round-trip time measurement done yet)
> is explicitly considered, I guess we can reasonably expect TCP stacks we
> might be talking to to be fully compatible with what we're doing, as
> long as we stick to the RFC.

Good points, you convinced me.

> >   2) We expect to be talking to a guest.  The chance of a high RTT is
> >      vanishingly small compared to a path to potentially any host on
> >      the 2011 internet.
> 
> ...until we move the guest / container and we implement a TCP transport
> (somewhat overdue) in place of the existing tap / UNIX domain /
> vhost-user connection.

Right.  Or even right now, if the guest then NATs the connection onto
a slow VPN link.

> The chance is still small and the consequences of ignoring this part of
> the RFC are, I'm fairly sure, negligible, but if it's that easy to
> implement, should we really depart from it?
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson