On Mon, Nov 03, 2025 at 11:38:57AM +0100, Stefano Brivio wrote: > On Mon, 3 Nov 2025 12:37:52 +1100 > David Gibson wrote: > > > On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote: > > > Clamp the TCP retry timeout as Linux kernel does. If RTO is less > > > than 3 seconds, re-initialize it to 3 seconds for data retransmissions > > > according to RFC 6298. > > > > > > Suggested-by: Stefano Brivio > > > Signed-off-by: Yumei Huang > > > --- > > > tcp.c | 25 ++++++++++++++++++++----- > > > tcp.h | 2 ++ > > > 2 files changed, 22 insertions(+), 5 deletions(-) > > > > > > diff --git a/tcp.c b/tcp.c > > > index 96ee56a..84a6700 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -187,6 +187,9 @@ > > > * for established connections, or (tcp_syn_retries + > > > * tcp_syn_linear_timeouts) times during the handshake, reset the connection > > > * > > > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for > > > + * data retransmissions. > > > + * > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > > * the connection > > > @@ -340,6 +343,7 @@ enum { > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > #define RTO_INIT 1 /* s, RFC 6298 */ > > > +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ > > > #define FIN_TIMEOUT 60 > > > #define ACT_TIMEOUT 7200 > > > > > > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX]; > > > > > > #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" > > > #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" > > > +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms" > > > > > > #define TCP_SYN_RETRIES_DEFAULT 6 > > > #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 > > > +#define TCP_RTO_MAX_MS_DEFAULT 120000 > > > > > > /* "Extended" data (not stored in the flow table) for TCP flow migration */ > > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; > > > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > > if (conn->flags & ACK_TO_TAP_DUE) { > > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > - int exp = conn->retries; > > > + int exp = conn->retries, timeout = RTO_INIT; > > > if (!(conn->events & ESTABLISHED)) > > > exp -= c->tcp.syn_linear_timeouts; > > > - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); > > > + else > > > + timeout = MAX(timeout, RTO_INIT_ACK); > > > > Possibly I missed something, since I only skimmed your discussion of > > this behaviour with Stefano. But I'm not convinced this is a correct > > interpretation of the RFC. (5.7) says "If the timer expires awaiting > > the ACK of a SYN segment ...". That is, I think it's only suggesting > > increasing the RTO to 3 at the data stage *if* we had at least one > > retry during the handshake. > > Oops, true, my bad. > > I guess I suggested the interpretation as of v7 because I just skimmed > Appendix A of RFC 6298 whose main function is to justify the reasons > behind lowering the initial timeout to one second, and I thought these > reason simply don't apply to the established phase, so we use three > seconds there. > > But that's clearly not the case, hence the "When this happens" clause > in the middle of Appendix A. > > > That is, unfortunately, much fiddlier to > > implement, since we need to remember what happened during the > > handshake to apply it here. > > Hmm, > > --- > diff --git a/tcp.c b/tcp.c > index c1eb5de..90c3ca1 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -397,7 +397,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = { > > static const char *tcp_flag_str[] __attribute((__unused__)) = { > "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", > - "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", > + "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", "SYN_RETRIED", > }; > > /* Listening sockets, used for automatic port forwarding in pasta mode only */ > @@ -597,7 +597,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn) > int exp = conn->retries, timeout = RTO_INIT; > if (!(conn->events & ESTABLISHED)) > exp -= c->tcp.syn_linear_timeouts; > - else > + else if (conn->flags & SYN_RETRIED) > timeout = MAX(timeout, RTO_INIT_ACK); > timeout <<= MAX(exp, 0); > it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); > @@ -2446,6 +2446,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) > flow_trace(conn, "SYN timeout, retry"); > tcp_send_flag(c, conn, SYN); > conn->retries++; > + conn_flag(c, conn, SYN_RETRIED); > tcp_timer_ctl(c, conn); > } > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > diff --git a/tcp_conn.h b/tcp_conn.h > index c006d56..87f4a2d 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -77,6 +77,7 @@ struct tcp_tap_conn { > #define ACK_TO_TAP_DUE BIT(3) > #define ACK_FROM_TAP_DUE BIT(4) > #define ACK_FROM_TAP_BLOCKS BIT(5) > +#define SYN_RETRIED BIT(6) > > #define SNDBUF_BITS 24 > unsigned int sndbuf :SNDBUF_BITS; > --- > > ? Ok, not as fiddly as I feared. \o/ > > Additionally, if I'm reading the RFC correctly, it's treating this as > > a one-time adjustment of the RTO, which won't necessarily remain the > > case for the entire data phase. Here this minimum will apply for the > > entire data phase. > > But it's the initial RTO (see Appendix A, which states it clearly), and > any exponentiation is based on the initial value, so this should fit > the requirement. > > > Even though it's a "MUST" in the RFC, I kind of think we could just > > skip this for two reasons: > > > > 1) We already don't bother with RTT measurements, which the RFC > > assumes the implementation is doing to adjust the RTO. > > Kind of: > > (2.1) Until a round-trip time (RTT) measurement has been made for a > segment sent between the sender and receiver, the sender SHOULD > set RTO <- 1 second > > and given that this condition (no round-trip time measurement done yet) > is explicitly considered, I guess we can reasonably expect TCP stacks we > might be talking to to be fully compatible with what we're doing, as > long as we stick to the RFC. Good points, you convinced me. > > 2) We expect to be talking to a guest. The chance of a high RTT is > > vanishingly small compared to a path to potentially any host on > > the 2011 internet. > > ...until we move the guest / container and we implement a TCP transport > (somewhat overdue) in place of the existing tap / UNIX domain / > vhost-user connection. Right. Or even right now, if the guest then NATs the connection onto a slow VPN link. > The chance is still small and the consequences of ignoring this part of > the RFC are, I'm fairly sure, negligible, but if it's that easy to > implement, should we really depart from it? > > -- > Stefano > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson