* [PATCH v7 0/5] Retry SYNs for inbound connections
@ 2025-10-31 5:42 Yumei Huang
2025-10-31 5:42 ` [PATCH v7 1/5] tcp: Rename "retrans" to "retries" Yumei Huang
` (4 more replies)
0 siblings, 5 replies; 34+ messages in thread
From: Yumei Huang @ 2025-10-31 5:42 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
When a client connects, SYN would be sent to guest only once. If the
guest is not connected or ready at that time, the connection will be
reset in 10s. These patches introduce the SYN retry mechanism using
the similar backoff timeout as linux kernel. Also update the data
retransmission timeout using the backoff timeout.
v7:
- Update read_file() and read_file_integer()
- Rename tcp_syn_params_init() to tcp_get_rto_params()
- Modify the implementation of the timeout assignment
- Add a patch to clamp the retry timeout
Yumei Huang (5):
tcp: Rename "retrans" to "retries"
util: Introduce read_file() and read_file_integer() function
tcp: Resend SYN for inbound connections
tcp: Update data retransmission timeout
tcp: Clamp the retry timeout
tcp.c | 86 ++++++++++++++++++++++++++++++++++++++++++------------
tcp.h | 6 ++++
tcp_conn.h | 12 ++++----
util.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
util.h | 2 ++
5 files changed, 168 insertions(+), 24 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v7 1/5] tcp: Rename "retrans" to "retries"
2025-10-31 5:42 [PATCH v7 0/5] Retry SYNs for inbound connections Yumei Huang
@ 2025-10-31 5:42 ` Yumei Huang
2025-10-31 5:42 ` [PATCH v7 2/5] util: Introduce read_file() and read_file_integer() function Yumei Huang
` (3 subsequent siblings)
4 siblings, 0 replies; 34+ messages in thread
From: Yumei Huang @ 2025-10-31 5:42 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
Rename "retrans" to "retries" so it can be used for SYN retries.
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 12 ++++++------
tcp_conn.h | 12 ++++++------
2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/tcp.c b/tcp.c
index 0f9e9b3..2ec4b0c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -186,7 +186,7 @@
* - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
* data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
* socket and reset sequence to what was acknowledged. If this persists for
- * more than TCP_MAX_RETRANS times in a row, reset the connection
+ * more than TCP_MAX_RETRIES times in a row, reset the connection
*
* - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
* with TAP_FIN_SENT event), and no ACK is received within this time, reset
@@ -1127,7 +1127,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
if (SEQ_LT(seq, conn->seq_to_tap))
conn_flag(c, conn, ACK_FROM_TAP_DUE);
- conn->retrans = 0;
+ conn->retries = 0;
conn->seq_ack_from_tap = seq;
}
}
@@ -2414,7 +2414,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
flow_dbg(conn, "FIN timeout");
tcp_rst(c, conn);
- } else if (conn->retrans == TCP_MAX_RETRANS) {
+ } else if (conn->retries == TCP_MAX_RETRIES) {
flow_dbg(conn, "retransmissions count exceeded");
tcp_rst(c, conn);
} else {
@@ -2423,7 +2423,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
- conn->retrans++;
+ conn->retries++;
if (tcp_rewind_seq(c, conn))
return;
@@ -3382,7 +3382,7 @@ static int tcp_flow_repair_opt(const struct tcp_tap_conn *conn,
int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
{
struct tcp_tap_transfer t = {
- .retrans = conn->retrans,
+ .retries = conn->retries,
.ws_from_tap = conn->ws_from_tap,
.ws_to_tap = conn->ws_to_tap,
.events = conn->events,
@@ -3662,7 +3662,7 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
memcpy(&flow->f.side, &t.side, sizeof(flow->f.side));
conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
- conn->retrans = t.retrans;
+ conn->retries = t.retries;
conn->ws_from_tap = t.ws_from_tap;
conn->ws_to_tap = t.ws_to_tap;
conn->events = t.events;
diff --git a/tcp_conn.h b/tcp_conn.h
index 38b5c54..e5c8146 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -13,7 +13,7 @@
* struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
* @f: Generic flow information
* @in_epoll: Is the connection in the epoll set?
- * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
+ * @retries: Number of retries occurred due to timeouts
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
@@ -38,9 +38,9 @@ struct tcp_tap_conn {
bool in_epoll :1;
-#define TCP_RETRANS_BITS 3
- unsigned int retrans :TCP_RETRANS_BITS;
-#define TCP_MAX_RETRANS MAX_FROM_BITS(TCP_RETRANS_BITS)
+#define TCP_RETRIES_BITS 3
+ unsigned int retries :TCP_RETRIES_BITS;
+#define TCP_MAX_RETRIES MAX_FROM_BITS(TCP_RETRIES_BITS)
#define TCP_WS_BITS 4 /* RFC 7323 */
#define TCP_WS_MAX 14
@@ -102,7 +102,7 @@ struct tcp_tap_conn {
* struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
* @pif: Interfaces for each side of the flow
* @side: Addresses and ports for each side of the flow
- * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
+ * @retries: Number of retries occurred due to timeouts
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @events: Connection events, implying connection states
@@ -122,7 +122,7 @@ struct tcp_tap_transfer {
uint8_t pif[SIDES];
struct flowside side[SIDES];
- uint8_t retrans;
+ uint8_t retries;
uint8_t ws_from_tap;
uint8_t ws_to_tap;
uint8_t events;
--
2.49.0
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v7 2/5] util: Introduce read_file() and read_file_integer() function
2025-10-31 5:42 [PATCH v7 0/5] Retry SYNs for inbound connections Yumei Huang
2025-10-31 5:42 ` [PATCH v7 1/5] tcp: Rename "retrans" to "retries" Yumei Huang
@ 2025-10-31 5:42 ` Yumei Huang
2025-11-03 0:53 ` David Gibson
2025-10-31 5:42 ` [PATCH v7 3/5] tcp: Resend SYN for inbound connections Yumei Huang
` (2 subsequent siblings)
4 siblings, 1 reply; 34+ messages in thread
From: Yumei Huang @ 2025-10-31 5:42 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
---
util.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
util.h | 2 ++
2 files changed, 88 insertions(+)
diff --git a/util.c b/util.c
index c492f90..b83de0d 100644
--- a/util.c
+++ b/util.c
@@ -579,6 +579,92 @@ int write_file(const char *path, const char *buf)
return len == 0 ? 0 : -1;
}
+/**
+ * read_file() - Read contents of file into a NULL-terminated buffer
+ * @path: Path to file to read
+ * @buf: Buffer to store file contents
+ * @buf_size: Size of buffer
+ *
+ * Return: number of bytes read on success, -1 on error, -ENOBUFS on truncation
+*/
+ssize_t read_file(const char *path, char *buf, size_t buf_size)
+{
+ int fd = open(path, O_RDONLY | O_CLOEXEC);
+ size_t total_read = 0;
+ ssize_t rc;
+
+ if (fd < 0) {
+ warn_perror("Could not open %s", path);
+ return -1;
+ }
+
+ while (total_read < buf_size) {
+ rc = read(fd, buf + total_read, buf_size - total_read);
+
+ if (rc < 0) {
+ warn_perror("Couldn't read from %s", path);
+ close(fd);
+ return -1;
+ }
+
+ if (rc == 0)
+ break;
+
+ total_read += rc;
+ }
+
+ close(fd);
+
+ if (total_read == buf_size) {
+ warn("File %s contents exceed buffer size %zu", path,
+ buf_size);
+ buf[buf_size - 1] = '\0';
+ return -ENOBUFS;
+ }
+
+ buf[total_read] = '\0';
+
+ return total_read;
+}
+
+/**
+ * read_file_integer() - Read an integer value from a file
+ * @path: Path to file to read
+ * @fallback: Default value if file can't be read
+ *
+ * Return: integer value, @fallback on failure
+*/
+intmax_t read_file_integer(const char *path, intmax_t fallback)
+{
+ ssize_t bytes_read;
+ char buf[BUFSIZ];
+ intmax_t value;
+ char *end;
+
+ bytes_read = read_file(path, buf, sizeof(buf));
+
+ if (bytes_read < 0)
+ return fallback;
+
+ if (bytes_read == 0) {
+ debug("Empty file %s", path);
+ return fallback;
+ }
+
+ errno = 0;
+ value = strtoimax(buf, &end, 10);
+ if (*end && *end != '\n') {
+ debug("Non-numeric content in %s", path);
+ return fallback;
+ }
+ if (errno) {
+ debug("Out of range value in %s: %s", path, buf);
+ return fallback;
+ }
+
+ return value;
+}
+
#ifdef __ia64__
/* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
* use the description from clone(2).
diff --git a/util.h b/util.h
index 22eaac5..595a17d 100644
--- a/util.h
+++ b/util.h
@@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
int __daemon(int pidfile_fd, int devnull_fd);
int fls(unsigned long x);
int write_file(const char *path, const char *buf);
+ssize_t read_file(const char *path, char *buf, size_t buf_size);
+intmax_t read_file_integer(const char *path, intmax_t fallback);
int write_all_buf(int fd, const void *buf, size_t len);
int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
int read_all_buf(int fd, void *buf, size_t len);
--
2.49.0
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-10-31 5:42 [PATCH v7 0/5] Retry SYNs for inbound connections Yumei Huang
2025-10-31 5:42 ` [PATCH v7 1/5] tcp: Rename "retrans" to "retries" Yumei Huang
2025-10-31 5:42 ` [PATCH v7 2/5] util: Introduce read_file() and read_file_integer() function Yumei Huang
@ 2025-10-31 5:42 ` Yumei Huang
2025-11-03 1:09 ` David Gibson
2025-11-04 4:42 ` Stefano Brivio
2025-10-31 5:42 ` [PATCH v7 4/5] tcp: Update data retransmission timeout Yumei Huang
2025-10-31 5:42 ` [PATCH v7 5/5] tcp: Clamp the retry timeout Yumei Huang
4 siblings, 2 replies; 34+ messages in thread
From: Yumei Huang @ 2025-10-31 5:42 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
If a client connects while guest is not connected or ready yet,
resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as Linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
---
tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
tcp.h | 4 ++++
2 files changed, 53 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c
index 2ec4b0c..bada88a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -179,9 +179,11 @@
*
* Timeouts are implemented by means of timerfd timers, set based on flags:
*
- * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
- * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
- * connection
+ * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
+ * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
+ * SYN. It's the starting timeout for the first SYN retry. Retry for
+ * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
+ * reset the connection
*
* - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
* data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
@@ -340,7 +342,7 @@ enum {
#define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */
-#define SYN_TIMEOUT 10 /* s */
+#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
#define ACK_TIMEOUT 2
#define FIN_TIMEOUT 60
#define ACT_TIMEOUT 7200
@@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
+#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
+
+#define TCP_SYN_RETRIES_DEFAULT 6
+#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
+
/* "Extended" data (not stored in the flow table) for TCP flow migration */
static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
if (conn->flags & ACK_TO_TAP_DUE) {
it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
} else if (conn->flags & ACK_FROM_TAP_DUE) {
- if (!(conn->events & ESTABLISHED))
- it.it_value.tv_sec = SYN_TIMEOUT;
+ if (!(conn->events & ESTABLISHED)) {
+ int exp = conn->retries - c->tcp.syn_linear_timeouts;
+ it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
+ }
else
it.it_value.tv_sec = ACK_TIMEOUT;
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
@@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
tcp_timer_ctl(c, conn);
} else if (conn->flags & ACK_FROM_TAP_DUE) {
if (!(conn->events & ESTABLISHED)) {
- flow_dbg(conn, "handshake timeout");
- tcp_rst(c, conn);
+ if (conn->retries >= TCP_MAX_RETRIES ||
+ conn->retries >= (c->tcp.tcp_syn_retries +
+ c->tcp.syn_linear_timeouts)) {
+ flow_dbg(conn, "handshake timeout");
+ tcp_rst(c, conn);
+ } else {
+ flow_trace(conn, "SYN timeout, retry");
+ tcp_send_flag(c, conn, SYN);
+ conn->retries++;
+ tcp_timer_ctl(c, conn);
+ }
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
flow_dbg(conn, "FIN timeout");
tcp_rst(c, conn);
@@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void)
return sl;
}
+/**
+ * tcp_get_rto_params() - Get host kernel RTO parameters
+ * @c: Execution context
+*/
+void tcp_get_rto_params(struct ctx *c)
+{
+ intmax_t tcp_syn_retries, syn_linear_timeouts;
+
+ tcp_syn_retries = read_file_integer(
+ TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
+ syn_linear_timeouts = read_file_integer(
+ TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
+
+ c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
+ c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
+
+ debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
+}
+
/**
* tcp_init() - Get initial sequence, hash secret, initialise per-socket data
* @c: Execution context
@@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c)
{
ASSERT(!c->no_tcp);
+ tcp_get_rto_params(c);
+
tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
diff --git a/tcp.h b/tcp.h
index 234a803..befedde 100644
--- a/tcp.h
+++ b/tcp.h
@@ -59,12 +59,16 @@ union tcp_listen_epoll_ref {
* @fwd_out: Port forwarding configuration for outbound packets
* @timer_run: Timestamp of most recent timer run
* @pipe_size: Size of pipes for spliced connections
+ * @tcp_syn_retries: SYN retries using exponential backoff timeout
+ * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
*/
struct tcp_ctx {
struct fwd_ports fwd_in;
struct fwd_ports fwd_out;
struct timespec timer_run;
size_t pipe_size;
+ uint8_t tcp_syn_retries;
+ uint8_t syn_linear_timeouts;
};
#endif /* TCP_H */
--
2.49.0
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-10-31 5:42 [PATCH v7 0/5] Retry SYNs for inbound connections Yumei Huang
` (2 preceding siblings ...)
2025-10-31 5:42 ` [PATCH v7 3/5] tcp: Resend SYN for inbound connections Yumei Huang
@ 2025-10-31 5:42 ` Yumei Huang
2025-10-31 5:56 ` Yumei Huang
2025-11-03 1:18 ` David Gibson
2025-10-31 5:42 ` [PATCH v7 5/5] tcp: Clamp the retry timeout Yumei Huang
4 siblings, 2 replies; 34+ messages in thread
From: Yumei Huang @ 2025-10-31 5:42 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
Use an exponential backoff timeout for data retransmission according
to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 30 ++++++++++++------------------
1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/tcp.c b/tcp.c
index bada88a..96ee56a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -179,16 +179,13 @@
*
* Timeouts are implemented by means of timerfd timers, set based on flags:
*
- * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
- * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
- * SYN. It's the starting timeout for the first SYN retry. Retry for
- * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
- * reset the connection
- *
- * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
- * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
- * socket and reset sequence to what was acknowledged. If this persists for
- * more than TCP_MAX_RETRIES times in a row, reset the connection
+ * - RTO_INIT: if no ACK segment was received from tap/guest, either during
+ * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
+ * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
+ * from the socket and reset sequence to what was acknowledged. This is the
+ * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times
+ * for established connections, or (tcp_syn_retries +
+ * tcp_syn_linear_timeouts) times during the handshake, reset the connection
*
* - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
* with TAP_FIN_SENT event), and no ACK is received within this time, reset
@@ -342,8 +339,7 @@ enum {
#define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */
-#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
-#define ACK_TIMEOUT 2
+#define RTO_INIT 1 /* s, RFC 6298 */
#define FIN_TIMEOUT 60
#define ACT_TIMEOUT 7200
@@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
if (conn->flags & ACK_TO_TAP_DUE) {
it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
} else if (conn->flags & ACK_FROM_TAP_DUE) {
- if (!(conn->events & ESTABLISHED)) {
- int exp = conn->retries - c->tcp.syn_linear_timeouts;
- it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
- }
- else
- it.it_value.tv_sec = ACK_TIMEOUT;
+ int exp = conn->retries;
+ if (!(conn->events & ESTABLISHED))
+ exp -= c->tcp.syn_linear_timeouts;
+ it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
it.it_value.tv_sec = FIN_TIMEOUT;
} else {
--
2.49.0
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-10-31 5:42 [PATCH v7 0/5] Retry SYNs for inbound connections Yumei Huang
` (3 preceding siblings ...)
2025-10-31 5:42 ` [PATCH v7 4/5] tcp: Update data retransmission timeout Yumei Huang
@ 2025-10-31 5:42 ` Yumei Huang
2025-10-31 8:38 ` Stefano Brivio
` (2 more replies)
4 siblings, 3 replies; 34+ messages in thread
From: Yumei Huang @ 2025-10-31 5:42 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
Clamp the TCP retry timeout as Linux kernel does. If RTO is less
than 3 seconds, re-initialize it to 3 seconds for data retransmissions
according to RFC 6298.
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
---
tcp.c | 25 ++++++++++++++++++++-----
tcp.h | 2 ++
2 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/tcp.c b/tcp.c
index 96ee56a..84a6700 100644
--- a/tcp.c
+++ b/tcp.c
@@ -187,6 +187,9 @@
* for established connections, or (tcp_syn_retries +
* tcp_syn_linear_timeouts) times during the handshake, reset the connection
*
+ * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
+ * data retransmissions.
+ *
* - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
* with TAP_FIN_SENT event), and no ACK is received within this time, reset
* the connection
@@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */
#define RTO_INIT 1 /* s, RFC 6298 */
+#define RTO_INIT_ACK 3 /* s, RFC 6298 */
#define FIN_TIMEOUT 60
#define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
+#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6
#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
+#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */
static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
if (conn->flags & ACK_TO_TAP_DUE) {
it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
} else if (conn->flags & ACK_FROM_TAP_DUE) {
- int exp = conn->retries;
+ int exp = conn->retries, timeout = RTO_INIT;
if (!(conn->events & ESTABLISHED))
exp -= c->tcp.syn_linear_timeouts;
- it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
+ else
+ timeout = MAX(timeout, RTO_INIT_ACK);
+ timeout <<= MAX(exp, 0);
+ it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
it.it_value.tv_sec = FIN_TIMEOUT;
} else {
@@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void)
*/
void tcp_get_rto_params(struct ctx *c)
{
- intmax_t tcp_syn_retries, syn_linear_timeouts;
+ intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
tcp_syn_retries = read_file_integer(
TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
syn_linear_timeouts = read_file_integer(
TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
+ tcp_rto_max_ms = read_file_integer(
+ TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
+ c->tcp.tcp_rto_max = MIN(
+ DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
- debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
- c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
+ debug("Read sysctl values tcp_syn_retries: %"PRIu8
+ ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu",
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts,
+ c->tcp.tcp_rto_max);
}
/**
diff --git a/tcp.h b/tcp.h
index befedde..a238bb7 100644
--- a/tcp.h
+++ b/tcp.h
@@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
* @fwd_out: Port forwarding configuration for outbound packets
* @timer_run: Timestamp of most recent timer run
* @pipe_size: Size of pipes for spliced connections
+ * @tcp_rto_max: Maximal retry timeout (in s)
* @tcp_syn_retries: SYN retries using exponential backoff timeout
* @syn_linear_timeouts: SYN retries before using exponential backoff timeout
*/
@@ -67,6 +68,7 @@ struct tcp_ctx {
struct fwd_ports fwd_out;
struct timespec timer_run;
size_t pipe_size;
+ size_t tcp_rto_max;
uint8_t tcp_syn_retries;
uint8_t syn_linear_timeouts;
};
--
2.49.0
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-10-31 5:42 ` [PATCH v7 4/5] tcp: Update data retransmission timeout Yumei Huang
@ 2025-10-31 5:56 ` Yumei Huang
2025-10-31 8:04 ` Stefano Brivio
2025-11-03 1:18 ` David Gibson
1 sibling, 1 reply; 34+ messages in thread
From: Yumei Huang @ 2025-10-31 5:56 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david
On Fri, Oct 31, 2025 at 1:43 PM Yumei Huang <yuhuang@redhat.com> wrote:
>
> Use an exponential backoff timeout for data retransmission according
> to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> in Appendix A of RFC 6298.
>
> Also combine the macros defining the initial RTO for both SYN and ACK.
>
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I probably should have removed the reviewed-by since there are some
code changes.
> ---
> tcp.c | 30 ++++++++++++------------------
> 1 file changed, 12 insertions(+), 18 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index bada88a..96ee56a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,16 +179,13 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> - * SYN. It's the starting timeout for the first SYN retry. Retry for
> - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> - * reset the connection
> - *
> - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> - * socket and reset sequence to what was acknowledged. If this persists for
> - * more than TCP_MAX_RETRIES times in a row, reset the connection
> + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> + * from the socket and reset sequence to what was acknowledged. This is the
> + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times
> + * for established connections, or (tcp_syn_retries +
> + * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> *
> * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> @@ -342,8 +339,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> -#define ACK_TIMEOUT 2
> +#define RTO_INIT 1 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> @@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - if (!(conn->events & ESTABLISHED)) {
> - int exp = conn->retries - c->tcp.syn_linear_timeouts;
> - it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> - }
> - else
> - it.it_value.tv_sec = ACK_TIMEOUT;
> + int exp = conn->retries;
> + if (!(conn->events & ESTABLISHED))
> + exp -= c->tcp.syn_linear_timeouts;
> + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
> --
> 2.49.0
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-10-31 5:56 ` Yumei Huang
@ 2025-10-31 8:04 ` Stefano Brivio
0 siblings, 0 replies; 34+ messages in thread
From: Stefano Brivio @ 2025-10-31 8:04 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Fri, 31 Oct 2025 13:56:41 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Fri, Oct 31, 2025 at 1:43 PM Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > Use an exponential backoff timeout for data retransmission according
> > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > in Appendix A of RFC 6298.
> >
> > Also combine the macros defining the initial RTO for both SYN and ACK.
> >
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> I probably should have removed the reviewed-by since there are some
> code changes.
Thanks for reporting.
I eventually notice during review and drop tags as needed but if you
notice you carried a tag by mistake, it's very appreciated that you
report it as it saves me some time and probably avoid confusion for
other reviewers.
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-10-31 5:42 ` [PATCH v7 5/5] tcp: Clamp the retry timeout Yumei Huang
@ 2025-10-31 8:38 ` Stefano Brivio
2025-11-03 3:11 ` Yumei Huang
2025-11-03 1:37 ` David Gibson
2025-11-04 4:42 ` Stefano Brivio
2 siblings, 1 reply; 34+ messages in thread
From: Stefano Brivio @ 2025-10-31 8:38 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Fri, 31 Oct 2025 13:42:42 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> according to RFC 6298.
>
> Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> ---
> tcp.c | 25 ++++++++++++++++++++-----
> tcp.h | 2 ++
> 2 files changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 96ee56a..84a6700 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -187,6 +187,9 @@
> * for established connections, or (tcp_syn_retries +
> * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> *
> + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
Nit: we started with BrE (British English) spelling and later tried to
keep that consistent for the sake of grep, so we'll usually use
"initialise" (for consistency, at this point).
> + * data retransmissions.
> + *
> * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> * the connection
> @@ -340,6 +343,7 @@ enum {
>
> #define ACK_INTERVAL 10 /* ms */
> #define RTO_INIT 1 /* s, RFC 6298 */
> +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
>
> #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
>
> #define TCP_SYN_RETRIES_DEFAULT 6
> #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> +#define TCP_RTO_MAX_MS_DEFAULT 120000
>
> /* "Extended" data (not stored in the flow table) for TCP flow migration */
> static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - int exp = conn->retries;
> + int exp = conn->retries, timeout = RTO_INIT;
> if (!(conn->events & ESTABLISHED))
> exp -= c->tcp.syn_linear_timeouts;
> - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> + else
> + timeout = MAX(timeout, RTO_INIT_ACK);
> + timeout <<= MAX(exp, 0);
> + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
> @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void)
> */
> void tcp_get_rto_params(struct ctx *c)
> {
> - intmax_t tcp_syn_retries, syn_linear_timeouts;
> + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
>
> tcp_syn_retries = read_file_integer(
> TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> syn_linear_timeouts = read_file_integer(
> TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> + tcp_rto_max_ms = read_file_integer(
> + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
>
> c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> + c->tcp.tcp_rto_max = MIN(
> + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
size_t looks like a rather weird choice for tcp_rto_max: size_t is used
to represent the size of objects, and nothing else (not a timeout in
milliseconds).
It's also an int in ipv4_net_table[], net/ipv4/sysctl_net_ipv4.c
(Linux kernel). Any reason for picking size_t here?
I mean, it will work, it just looks weird (and wastes a tiny bit of space,
even though I guess we don't care about 4 bytes there).
> - debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> + debug("Read sysctl values tcp_syn_retries: %"PRIu8
> + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu",
> + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts,
> + c->tcp.tcp_rto_max);
> }
>
> /**
> diff --git a/tcp.h b/tcp.h
> index befedde..a238bb7 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
> * @fwd_out: Port forwarding configuration for outbound packets
> * @timer_run: Timestamp of most recent timer run
> * @pipe_size: Size of pipes for spliced connections
> + * @tcp_rto_max: Maximal retry timeout (in s)
Nit: "maximal" has a slightly different meaning compared to "maximum".
The highest value allowed for a field would typically be called
"maximum", while "maximal" is more commonly used to indicate a value /
element that's the biggest of all values. Yes, I know, it's complicated.
> * @tcp_syn_retries: SYN retries using exponential backoff timeout
> * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> */
> @@ -67,6 +68,7 @@ struct tcp_ctx {
> struct fwd_ports fwd_out;
> struct timespec timer_run;
> size_t pipe_size;
> + size_t tcp_rto_max;
> uint8_t tcp_syn_retries;
> uint8_t syn_linear_timeouts;
> };
The rest of the series looks good to me at a *very* quick glance, but I
can't claim I really reviewed it yet.
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 2/5] util: Introduce read_file() and read_file_integer() function
2025-10-31 5:42 ` [PATCH v7 2/5] util: Introduce read_file() and read_file_integer() function Yumei Huang
@ 2025-11-03 0:53 ` David Gibson
0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-11-03 0:53 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 3559 bytes --]
On Fri, Oct 31, 2025 at 01:42:39PM +0800, Yumei Huang wrote:
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> util.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> util.h | 2 ++
> 2 files changed, 88 insertions(+)
>
> diff --git a/util.c b/util.c
> index c492f90..b83de0d 100644
> --- a/util.c
> +++ b/util.c
> @@ -579,6 +579,92 @@ int write_file(const char *path, const char *buf)
> return len == 0 ? 0 : -1;
> }
>
> +/**
> + * read_file() - Read contents of file into a NULL-terminated buffer
> + * @path: Path to file to read
> + * @buf: Buffer to store file contents
> + * @buf_size: Size of buffer
> + *
> + * Return: number of bytes read on success, -1 on error, -ENOBUFS on truncation
> +*/
> +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> +{
> + int fd = open(path, O_RDONLY | O_CLOEXEC);
> + size_t total_read = 0;
> + ssize_t rc;
> +
> + if (fd < 0) {
> + warn_perror("Could not open %s", path);
> + return -1;
> + }
> +
> + while (total_read < buf_size) {
> + rc = read(fd, buf + total_read, buf_size - total_read);
> +
> + if (rc < 0) {
> + warn_perror("Couldn't read from %s", path);
> + close(fd);
> + return -1;
> + }
> +
> + if (rc == 0)
> + break;
> +
> + total_read += rc;
> + }
> +
> + close(fd);
> +
> + if (total_read == buf_size) {
> + warn("File %s contents exceed buffer size %zu", path,
> + buf_size);
> + buf[buf_size - 1] = '\0';
> + return -ENOBUFS;
> + }
> +
> + buf[total_read] = '\0';
> +
> + return total_read;
> +}
> +
> +/**
> + * read_file_integer() - Read an integer value from a file
> + * @path: Path to file to read
> + * @fallback: Default value if file can't be read
> + *
> + * Return: integer value, @fallback on failure
> +*/
> +intmax_t read_file_integer(const char *path, intmax_t fallback)
> +{
> + ssize_t bytes_read;
> + char buf[BUFSIZ];
> + intmax_t value;
> + char *end;
> +
> + bytes_read = read_file(path, buf, sizeof(buf));
> +
> + if (bytes_read < 0)
> + return fallback;
> +
> + if (bytes_read == 0) {
> + debug("Empty file %s", path);
> + return fallback;
> + }
> +
> + errno = 0;
> + value = strtoimax(buf, &end, 10);
> + if (*end && *end != '\n') {
> + debug("Non-numeric content in %s", path);
> + return fallback;
> + }
> + if (errno) {
> + debug("Out of range value in %s: %s", path, buf);
> + return fallback;
> + }
> +
> + return value;
> +}
> +
> #ifdef __ia64__
> /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> * use the description from clone(2).
> diff --git a/util.h b/util.h
> index 22eaac5..595a17d 100644
> --- a/util.h
> +++ b/util.h
> @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> int __daemon(int pidfile_fd, int devnull_fd);
> int fls(unsigned long x);
> int write_file(const char *path, const char *buf);
> +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> +intmax_t read_file_integer(const char *path, intmax_t fallback);
> int write_all_buf(int fd, const void *buf, size_t len);
> int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> int read_all_buf(int fd, void *buf, size_t len);
> --
> 2.49.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-10-31 5:42 ` [PATCH v7 3/5] tcp: Resend SYN for inbound connections Yumei Huang
@ 2025-11-03 1:09 ` David Gibson
2025-11-03 2:31 ` Yumei Huang
2025-11-04 4:42 ` Stefano Brivio
1 sibling, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-11-03 1:09 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 6458 bytes --]
On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
> If a client connects while guest is not connected or ready yet,
> resend SYN instead of just resetting connection after 10 seconds.
>
> Use the same backoff calculation for the timeout as Linux kernel.
>
> Link: https://bugs.passt.top/show_bug.cgi?id=153
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
There are a couple of tiny style nits and a mostly theoretical bug
noted below.
> ---
> tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> tcp.h | 4 ++++
> 2 files changed, 53 insertions(+), 8 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 2ec4b0c..bada88a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,9 +179,11 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> - * connection
> + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
> + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> + * SYN. It's the starting timeout for the first SYN retry. Retry for
> + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> + * reset the connection
> *
> * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> @@ -340,7 +342,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT 10 /* s */
> +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> #define ACK_TIMEOUT 2
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
> @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
>
> #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
>
> +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> +
> +#define TCP_SYN_RETRIES_DEFAULT 6
> +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> +
> /* "Extended" data (not stored in the flow table) for TCP flow migration */
> static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
>
> @@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - if (!(conn->events & ESTABLISHED))
> - it.it_value.tv_sec = SYN_TIMEOUT;
> + if (!(conn->events & ESTABLISHED)) {
> + int exp = conn->retries - c->tcp.syn_linear_timeouts;
> + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> + }
> else
A non-obvious detail of the style we use is that if one branch of an
if uses { }, then the other branch should as well, even if it's one
line. So above should be "} else {" and the bit below changed to
match.
> it.it_value.tv_sec = ACK_TIMEOUT;
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> tcp_timer_ctl(c, conn);
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> if (!(conn->events & ESTABLISHED)) {
> - flow_dbg(conn, "handshake timeout");
> - tcp_rst(c, conn);
> + if (conn->retries >= TCP_MAX_RETRIES ||
> + conn->retries >= (c->tcp.tcp_syn_retries +
> + c->tcp.syn_linear_timeouts)) {
Here's the theoretical bug. We only clamp tcp_syn_retries and
syn_linear_timeouts to a uint8_t, so if the host has these set to
strangely large values, this addition could overflow.
> + flow_dbg(conn, "handshake timeout");
> + tcp_rst(c, conn);
> + } else {
> + flow_trace(conn, "SYN timeout, retry");
> + tcp_send_flag(c, conn, SYN);
> + conn->retries++;
> + tcp_timer_ctl(c, conn);
> + }
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> flow_dbg(conn, "FIN timeout");
> tcp_rst(c, conn);
> @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void)
> return sl;
> }
>
> +/**
> + * tcp_get_rto_params() - Get host kernel RTO parameters
> + * @c: Execution context
> +*/
> +void tcp_get_rto_params(struct ctx *c)
> +{
> + intmax_t tcp_syn_retries, syn_linear_timeouts;
> +
> + tcp_syn_retries = read_file_integer(
> + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> + syn_linear_timeouts = read_file_integer(
> + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> +
> + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> +
> + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> +}
> +
> /**
> * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> * @c: Execution context
> @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c)
> {
> ASSERT(!c->no_tcp);
>
> + tcp_get_rto_params(c);
> +
> tcp_sock_iov_init(c);
>
> memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> diff --git a/tcp.h b/tcp.h
> index 234a803..befedde 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref {
> * @fwd_out: Port forwarding configuration for outbound packets
> * @timer_run: Timestamp of most recent timer run
> * @pipe_size: Size of pipes for spliced connections
> + * @tcp_syn_retries: SYN retries using exponential backoff timeout
> + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> */
> struct tcp_ctx {
> struct fwd_ports fwd_in;
> struct fwd_ports fwd_out;
> struct timespec timer_run;
> size_t pipe_size;
> + uint8_t tcp_syn_retries;
> + uint8_t syn_linear_timeouts;
> };
>
> #endif /* TCP_H */
> --
> 2.49.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-10-31 5:42 ` [PATCH v7 4/5] tcp: Update data retransmission timeout Yumei Huang
2025-10-31 5:56 ` Yumei Huang
@ 2025-11-03 1:18 ` David Gibson
2025-11-03 2:57 ` Yumei Huang
1 sibling, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-11-03 1:18 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 4202 bytes --]
On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
> Use an exponential backoff timeout for data retransmission according
> to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> in Appendix A of RFC 6298.
>
> Also combine the macros defining the initial RTO for both SYN and ACK.
>
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As reported, the carried over R-b was a minor mistake, since the code
has changed, but here's a new one:
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Small comment below, though.
> ---
> tcp.c | 30 ++++++++++++------------------
> 1 file changed, 12 insertions(+), 18 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index bada88a..96ee56a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,16 +179,13 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> - * SYN. It's the starting timeout for the first SYN retry. Retry for
> - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> - * reset the connection
> - *
> - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> - * socket and reset sequence to what was acknowledged. If this persists for
> - * more than TCP_MAX_RETRIES times in a row, reset the connection
> + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> + * from the socket and reset sequence to what was acknowledged. This is the
> + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times
> + * for established connections, or (tcp_syn_retries +
> + * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> *
> * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> @@ -342,8 +339,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> -#define ACK_TIMEOUT 2
> +#define RTO_INIT 1 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> @@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - if (!(conn->events & ESTABLISHED)) {
> - int exp = conn->retries - c->tcp.syn_linear_timeouts;
I didn't spot it in the previous patch, but this is (theoretically)
buggy. conn->retries is unsigned, so the subtraction will be
performed unsigned and only then cast to signed. I think that will
probably do the right thing in practice, but I don't think that's
guaranteed by the C standard (and might even be UB).
> - it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> - }
> - else
> - it.it_value.tv_sec = ACK_TIMEOUT;
> + int exp = conn->retries;
This change fixes it, by forcing the cast to a signed int before the
subtraction. It also removes the minor style error I noted in the
previous patch. Given that, I don't think we need to worry about
either of them.
> + if (!(conn->events & ESTABLISHED))
> + exp -= c->tcp.syn_linear_timeouts;
> + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
> --
> 2.49.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-10-31 5:42 ` [PATCH v7 5/5] tcp: Clamp the retry timeout Yumei Huang
2025-10-31 8:38 ` Stefano Brivio
@ 2025-11-03 1:37 ` David Gibson
2025-11-03 4:06 ` Yumei Huang
2025-11-03 10:38 ` Stefano Brivio
2025-11-04 4:42 ` Stefano Brivio
2 siblings, 2 replies; 34+ messages in thread
From: David Gibson @ 2025-11-03 1:37 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 5864 bytes --]
On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
> Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> according to RFC 6298.
>
> Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> ---
> tcp.c | 25 ++++++++++++++++++++-----
> tcp.h | 2 ++
> 2 files changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 96ee56a..84a6700 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -187,6 +187,9 @@
> * for established connections, or (tcp_syn_retries +
> * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> *
> + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
> + * data retransmissions.
> + *
> * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> * the connection
> @@ -340,6 +343,7 @@ enum {
>
> #define ACK_INTERVAL 10 /* ms */
> #define RTO_INIT 1 /* s, RFC 6298 */
> +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
>
> #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
>
> #define TCP_SYN_RETRIES_DEFAULT 6
> #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> +#define TCP_RTO_MAX_MS_DEFAULT 120000
>
> /* "Extended" data (not stored in the flow table) for TCP flow migration */
> static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - int exp = conn->retries;
> + int exp = conn->retries, timeout = RTO_INIT;
> if (!(conn->events & ESTABLISHED))
> exp -= c->tcp.syn_linear_timeouts;
> - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> + else
> + timeout = MAX(timeout, RTO_INIT_ACK);
Possibly I missed something, since I only skimmed your discussion of
this behaviour with Stefano. But I'm not convinced this is a correct
interpretation of the RFC. (5.7) says "If the timer expires awaiting
the ACK of a SYN segment ...". That is, I think it's only suggesting
increasing the RTO to 3 at the data stage *if* we had at least one
retry during the handshake. That is, unfortunately, much fiddlier to
implement, since we need to remember what happened during the
handshake to apply it here.
Additionally, if I'm reading the RFC correctly, it's treating this as
a one-time adjustment of the RTO, which won't necessarily remain the
case for the entire data phase. Here this minimum will apply for the
entire data phase.
Even though it's a "MUST" in the RFC, I kind of think we could just
skip this for two reasons:
1) We already don't bother with RTT measurements, which the RFC
assumes the implementation is doing to adjust the RTO.
2) We expect to be talking to a guest. The chance of a high RTT is
vanishingly small compared to a path to potentially any host on
the 2011 internet.
> + timeout <<= MAX(exp, 0);
> + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
> @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void)
> */
> void tcp_get_rto_params(struct ctx *c)
> {
> - intmax_t tcp_syn_retries, syn_linear_timeouts;
> + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
>
> tcp_syn_retries = read_file_integer(
> TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> syn_linear_timeouts = read_file_integer(
> TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> + tcp_rto_max_ms = read_file_integer(
> + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
>
> c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> + c->tcp.tcp_rto_max = MIN(
> + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
>
> - debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> + debug("Read sysctl values tcp_syn_retries: %"PRIu8
> + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu",
> + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts,
> + c->tcp.tcp_rto_max);
> }
>
> /**
> diff --git a/tcp.h b/tcp.h
> index befedde..a238bb7 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
> * @fwd_out: Port forwarding configuration for outbound packets
> * @timer_run: Timestamp of most recent timer run
> * @pipe_size: Size of pipes for spliced connections
> + * @tcp_rto_max: Maximal retry timeout (in s)
> * @tcp_syn_retries: SYN retries using exponential backoff timeout
> * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> */
> @@ -67,6 +68,7 @@ struct tcp_ctx {
> struct fwd_ports fwd_out;
> struct timespec timer_run;
> size_t pipe_size;
> + size_t tcp_rto_max;
> uint8_t tcp_syn_retries;
> uint8_t syn_linear_timeouts;
> };
> --
> 2.49.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-03 1:09 ` David Gibson
@ 2025-11-03 2:31 ` Yumei Huang
2025-11-03 9:01 ` David Gibson
0 siblings, 1 reply; 34+ messages in thread
From: Yumei Huang @ 2025-11-03 2:31 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, sbrivio
On Mon, Nov 3, 2025 at 9:10 AM David Gibson <david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
> > If a client connects while guest is not connected or ready yet,
> > resend SYN instead of just resetting connection after 10 seconds.
> >
> > Use the same backoff calculation for the timeout as Linux kernel.
> >
> > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> There are a couple of tiny style nits and a mostly theoretical bug
> noted below.
>
>
> > ---
> > tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> > tcp.h | 4 ++++
> > 2 files changed, 53 insertions(+), 8 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 2ec4b0c..bada88a 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -179,9 +179,11 @@
> > *
> > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > *
> > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > - * connection
> > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
>
> Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
>
> > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > + * SYN. It's the starting timeout for the first SYN retry. Retry for
> > + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> > + * reset the connection
> > *
> > * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > @@ -340,7 +342,7 @@ enum {
> > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> >
> > #define ACK_INTERVAL 10 /* ms */
> > -#define SYN_TIMEOUT 10 /* s */
> > +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> > #define ACK_TIMEOUT 2
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> > @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> >
> > #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> >
> > +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > +
> > +#define TCP_SYN_RETRIES_DEFAULT 6
> > +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > +
> > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> >
> > @@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > - if (!(conn->events & ESTABLISHED))
> > - it.it_value.tv_sec = SYN_TIMEOUT;
> > + if (!(conn->events & ESTABLISHED)) {
> > + int exp = conn->retries - c->tcp.syn_linear_timeouts;
> > + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> > + }
> > else
>
> A non-obvious detail of the style we use is that if one branch of an
> if uses { }, then the other branch should as well, even if it's one
> line. So above should be "} else {" and the bit below changed to
> match.
Oh, I didn't know about this style. Will update it later.
>
> > it.it_value.tv_sec = ACK_TIMEOUT;
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > tcp_timer_ctl(c, conn);
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > if (!(conn->events & ESTABLISHED)) {
> > - flow_dbg(conn, "handshake timeout");
> > - tcp_rst(c, conn);
> > + if (conn->retries >= TCP_MAX_RETRIES ||
> > + conn->retries >= (c->tcp.tcp_syn_retries +
> > + c->tcp.syn_linear_timeouts)) {
>
> Here's the theoretical bug. We only clamp tcp_syn_retries and
> syn_linear_timeouts to a uint8_t, so if the host has these set to
> strangely large values, this addition could overflow.
Just checked net/ipv4/sysctl_net_ipv4.c,
static int tcp_syn_retries_min = 1;
static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
And MAX_TCP_SYNCNT is defined as 127:
#define MAX_TCP_SYNCNT 127
they have a max value as 127, so there won't be overflows.
>
> > + flow_dbg(conn, "handshake timeout");
> > + tcp_rst(c, conn);
> > + } else {
> > + flow_trace(conn, "SYN timeout, retry");
> > + tcp_send_flag(c, conn, SYN);
> > + conn->retries++;
> > + tcp_timer_ctl(c, conn);
> > + }
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > flow_dbg(conn, "FIN timeout");
> > tcp_rst(c, conn);
> > @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void)
> > return sl;
> > }
> >
> > +/**
> > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > + * @c: Execution context
> > +*/
> > +void tcp_get_rto_params(struct ctx *c)
> > +{
> > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > +
> > + tcp_syn_retries = read_file_integer(
> > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > + syn_linear_timeouts = read_file_integer(
> > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > +
> > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > +
> > + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > +}
> > +
> > /**
> > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > * @c: Execution context
> > @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c)
> > {
> > ASSERT(!c->no_tcp);
> >
> > + tcp_get_rto_params(c);
> > +
> > tcp_sock_iov_init(c);
> >
> > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > diff --git a/tcp.h b/tcp.h
> > index 234a803..befedde 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref {
> > * @fwd_out: Port forwarding configuration for outbound packets
> > * @timer_run: Timestamp of most recent timer run
> > * @pipe_size: Size of pipes for spliced connections
> > + * @tcp_syn_retries: SYN retries using exponential backoff timeout
> > + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> > */
> > struct tcp_ctx {
> > struct fwd_ports fwd_in;
> > struct fwd_ports fwd_out;
> > struct timespec timer_run;
> > size_t pipe_size;
> > + uint8_t tcp_syn_retries;
> > + uint8_t syn_linear_timeouts;
> > };
> >
> > #endif /* TCP_H */
> > --
> > 2.49.0
> >
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-11-03 1:18 ` David Gibson
@ 2025-11-03 2:57 ` Yumei Huang
2025-11-03 9:32 ` David Gibson
0 siblings, 1 reply; 34+ messages in thread
From: Yumei Huang @ 2025-11-03 2:57 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, sbrivio
On Mon, Nov 3, 2025 at 9:38 AM David Gibson <david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
> > Use an exponential backoff timeout for data retransmission according
> > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > in Appendix A of RFC 6298.
> >
> > Also combine the macros defining the initial RTO for both SYN and ACK.
> >
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> As reported, the carried over R-b was a minor mistake, since the code
> has changed, but here's a new one:
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> Small comment below, though.
>
> > ---
> > tcp.c | 30 ++++++++++++------------------
> > 1 file changed, 12 insertions(+), 18 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index bada88a..96ee56a 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -179,16 +179,13 @@
> > *
> > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > *
> > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > - * SYN. It's the starting timeout for the first SYN retry. Retry for
> > - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> > - * reset the connection
> > - *
> > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > - * socket and reset sequence to what was acknowledged. If this persists for
> > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > + * from the socket and reset sequence to what was acknowledged. This is the
> > + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times
> > + * for established connections, or (tcp_syn_retries +
> > + * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > *
> > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > @@ -342,8 +339,7 @@ enum {
> > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> >
> > #define ACK_INTERVAL 10 /* ms */
> > -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> > -#define ACK_TIMEOUT 2
> > +#define RTO_INIT 1 /* s, RFC 6298 */
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> >
> > @@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > - if (!(conn->events & ESTABLISHED)) {
> > - int exp = conn->retries - c->tcp.syn_linear_timeouts;
>
> I didn't spot it in the previous patch, but this is (theoretically)
> buggy. conn->retries is unsigned, so the subtraction will be
> performed unsigned and only then cast to signed. I think that will
> probably do the right thing in practice, but I don't think that's
> guaranteed by the C standard (and might even be UB).
I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries
and c->tcp.syn_linear_timeouts) will go through integer promotion
before subtraction. So the line is like:
int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts;
Please correct me if I'm wrong.
>
> > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> > - }
> > - else
> > - it.it_value.tv_sec = ACK_TIMEOUT;
> > + int exp = conn->retries;
>
> This change fixes it, by forcing the cast to a signed int before the
> subtraction. It also removes the minor style error I noted in the
> previous patch. Given that, I don't think we need to worry about
> either of them.
>
> > + if (!(conn->events & ESTABLISHED))
> > + exp -= c->tcp.syn_linear_timeouts;
> > + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > it.it_value.tv_sec = FIN_TIMEOUT;
> > } else {
> > --
> > 2.49.0
> >
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-10-31 8:38 ` Stefano Brivio
@ 2025-11-03 3:11 ` Yumei Huang
2025-11-03 9:37 ` David Gibson
2025-11-03 10:55 ` Stefano Brivio
0 siblings, 2 replies; 34+ messages in thread
From: Yumei Huang @ 2025-11-03 3:11 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Fri, Oct 31, 2025 at 4:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 31 Oct 2025 13:42:42 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> > than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> > according to RFC 6298.
> >
> > Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > ---
> > tcp.c | 25 ++++++++++++++++++++-----
> > tcp.h | 2 ++
> > 2 files changed, 22 insertions(+), 5 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 96ee56a..84a6700 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -187,6 +187,9 @@
> > * for established connections, or (tcp_syn_retries +
> > * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > *
> > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
>
> Nit: we started with BrE (British English) spelling and later tried to
> keep that consistent for the sake of grep, so we'll usually use
> "initialise" (for consistency, at this point).
Got it.
>
> > + * data retransmissions.
> > + *
> > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > * the connection
> > @@ -340,6 +343,7 @@ enum {
> >
> > #define ACK_INTERVAL 10 /* ms */
> > #define RTO_INIT 1 /* s, RFC 6298 */
> > +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> >
> > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> >
> > #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
> >
> > #define TCP_SYN_RETRIES_DEFAULT 6
> > #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > +#define TCP_RTO_MAX_MS_DEFAULT 120000
> >
> > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > - int exp = conn->retries;
> > + int exp = conn->retries, timeout = RTO_INIT;
> > if (!(conn->events & ESTABLISHED))
> > exp -= c->tcp.syn_linear_timeouts;
> > - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > + else
> > + timeout = MAX(timeout, RTO_INIT_ACK);
> > + timeout <<= MAX(exp, 0);
> > + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > it.it_value.tv_sec = FIN_TIMEOUT;
> > } else {
> > @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > */
> > void tcp_get_rto_params(struct ctx *c)
> > {
> > - intmax_t tcp_syn_retries, syn_linear_timeouts;
> > + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
> >
> > tcp_syn_retries = read_file_integer(
> > TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > syn_linear_timeouts = read_file_integer(
> > TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > + tcp_rto_max_ms = read_file_integer(
> > + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
> >
> > c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > + c->tcp.tcp_rto_max = MIN(
> > + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
>
> size_t looks like a rather weird choice for tcp_rto_max: size_t is used
> to represent the size of objects, and nothing else (not a timeout in
> milliseconds).
>
> It's also an int in ipv4_net_table[], net/ipv4/sysctl_net_ipv4.c
> (Linux kernel). Any reason for picking size_t here?
No particular reason. Just thought it can be used for counting as
well. I can change it to int in v8.
>
> I mean, it will work, it just looks weird (and wastes a tiny bit of space,
> even though I guess we don't care about 4 bytes there).
>
> > - debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> > - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > + debug("Read sysctl values tcp_syn_retries: %"PRIu8
> > + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu",
> > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts,
> > + c->tcp.tcp_rto_max);
> > }
> >
> > /**
> > diff --git a/tcp.h b/tcp.h
> > index befedde..a238bb7 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
> > * @fwd_out: Port forwarding configuration for outbound packets
> > * @timer_run: Timestamp of most recent timer run
> > * @pipe_size: Size of pipes for spliced connections
> > + * @tcp_rto_max: Maximal retry timeout (in s)
>
> Nit: "maximal" has a slightly different meaning compared to "maximum".
>
> The highest value allowed for a field would typically be called
> "maximum", while "maximal" is more commonly used to indicate a value /
> element that's the biggest of all values. Yes, I know, it's complicated.
Yeah, it's complicated. Actually I get the word from networking/ip-sysctl.rst.
tcp_rto_max_ms - INTEGER
Maximal TCP retransmission timeout (in ms).
"maximum" might be more appropriate as you explained.
>
> > * @tcp_syn_retries: SYN retries using exponential backoff timeout
> > * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> > */
> > @@ -67,6 +68,7 @@ struct tcp_ctx {
> > struct fwd_ports fwd_out;
> > struct timespec timer_run;
> > size_t pipe_size;
> > + size_t tcp_rto_max;
> > uint8_t tcp_syn_retries;
> > uint8_t syn_linear_timeouts;
> > };
>
> The rest of the series looks good to me at a *very* quick glance, but I
> can't claim I really reviewed it yet.
Thank you. Please take your time if you'd like to review them further.
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-11-03 1:37 ` David Gibson
@ 2025-11-03 4:06 ` Yumei Huang
2025-11-03 10:38 ` Stefano Brivio
1 sibling, 0 replies; 34+ messages in thread
From: Yumei Huang @ 2025-11-03 4:06 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, sbrivio
On Mon, Nov 3, 2025 at 9:38 AM David Gibson <david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
> > Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> > than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> > according to RFC 6298.
> >
> > Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > ---
> > tcp.c | 25 ++++++++++++++++++++-----
> > tcp.h | 2 ++
> > 2 files changed, 22 insertions(+), 5 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 96ee56a..84a6700 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -187,6 +187,9 @@
> > * for established connections, or (tcp_syn_retries +
> > * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > *
> > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
> > + * data retransmissions.
> > + *
> > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > * the connection
> > @@ -340,6 +343,7 @@ enum {
> >
> > #define ACK_INTERVAL 10 /* ms */
> > #define RTO_INIT 1 /* s, RFC 6298 */
> > +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> >
> > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> >
> > #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
> >
> > #define TCP_SYN_RETRIES_DEFAULT 6
> > #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > +#define TCP_RTO_MAX_MS_DEFAULT 120000
> >
> > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > - int exp = conn->retries;
> > + int exp = conn->retries, timeout = RTO_INIT;
> > if (!(conn->events & ESTABLISHED))
> > exp -= c->tcp.syn_linear_timeouts;
> > - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > + else
> > + timeout = MAX(timeout, RTO_INIT_ACK);
>
> Possibly I missed something, since I only skimmed your discussion of
> this behaviour with Stefano. But I'm not convinced this is a correct
> interpretation of the RFC. (5.7) says "If the timer expires awaiting
> the ACK of a SYN segment ...". That is, I think it's only suggesting
> increasing the RTO to 3 at the data stage *if* we had at least one
> retry during the handshake. That is, unfortunately, much fiddlier to
> implement, since we need to remember what happened during the
> handshake to apply it here.
Yes, you are right. Stefano thought it would be simpler than
re-introducing separate starting values.
>
> Additionally, if I'm reading the RFC correctly, it's treating this as
> a one-time adjustment of the RTO, which won't necessarily remain the
> case for the entire data phase. Here this minimum will apply for the
> entire data phase.
>
> Even though it's a "MUST" in the RFC, I kind of think we could just
> skip this for two reasons:
I will leave the discussion to the two of you :)
>
> 1) We already don't bother with RTT measurements, which the RFC
> assumes the implementation is doing to adjust the RTO.
>
> 2) We expect to be talking to a guest. The chance of a high RTT is
> vanishingly small compared to a path to potentially any host on
> the 2011 internet.
>
> > + timeout <<= MAX(exp, 0);
> > + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > it.it_value.tv_sec = FIN_TIMEOUT;
> > } else {
> > @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > */
> > void tcp_get_rto_params(struct ctx *c)
> > {
> > - intmax_t tcp_syn_retries, syn_linear_timeouts;
> > + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
> >
> > tcp_syn_retries = read_file_integer(
> > TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > syn_linear_timeouts = read_file_integer(
> > TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > + tcp_rto_max_ms = read_file_integer(
> > + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
> >
> > c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > + c->tcp.tcp_rto_max = MIN(
> > + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
> >
> > - debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> > - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > + debug("Read sysctl values tcp_syn_retries: %"PRIu8
> > + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu",
> > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts,
> > + c->tcp.tcp_rto_max);
> > }
> >
> > /**
> > diff --git a/tcp.h b/tcp.h
> > index befedde..a238bb7 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
> > * @fwd_out: Port forwarding configuration for outbound packets
> > * @timer_run: Timestamp of most recent timer run
> > * @pipe_size: Size of pipes for spliced connections
> > + * @tcp_rto_max: Maximal retry timeout (in s)
> > * @tcp_syn_retries: SYN retries using exponential backoff timeout
> > * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> > */
> > @@ -67,6 +68,7 @@ struct tcp_ctx {
> > struct fwd_ports fwd_out;
> > struct timespec timer_run;
> > size_t pipe_size;
> > + size_t tcp_rto_max;
> > uint8_t tcp_syn_retries;
> > uint8_t syn_linear_timeouts;
> > };
> > --
> > 2.49.0
> >
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-03 2:31 ` Yumei Huang
@ 2025-11-03 9:01 ` David Gibson
2025-11-04 4:42 ` Stefano Brivio
0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-11-03 9:01 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 9222 bytes --]
On Mon, Nov 03, 2025 at 10:31:21AM +0800, Yumei Huang wrote:
> On Mon, Nov 3, 2025 at 9:10 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
> > > If a client connects while guest is not connected or ready yet,
> > > resend SYN instead of just resetting connection after 10 seconds.
> > >
> > > Use the same backoff calculation for the timeout as Linux kernel.
> > >
> > > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> >
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> > There are a couple of tiny style nits and a mostly theoretical bug
> > noted below.
> >
> >
> > > ---
> > > tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> > > tcp.h | 4 ++++
> > > 2 files changed, 53 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 2ec4b0c..bada88a 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -179,9 +179,11 @@
> > > *
> > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > *
> > > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > > - * connection
> > > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> >
> > Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
> >
> > > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > + * SYN. It's the starting timeout for the first SYN retry. Retry for
> > > + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> > > + * reset the connection
> > > *
> > > * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > @@ -340,7 +342,7 @@ enum {
> > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > >
> > > #define ACK_INTERVAL 10 /* ms */
> > > -#define SYN_TIMEOUT 10 /* s */
> > > +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> > > #define ACK_TIMEOUT 2
> > > #define FIN_TIMEOUT 60
> > > #define ACT_TIMEOUT 7200
> > > @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> > >
> > > #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> > >
> > > +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > > +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > > +
> > > +#define TCP_SYN_RETRIES_DEFAULT 6
> > > +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > > +
> > > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > >
> > > @@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > - if (!(conn->events & ESTABLISHED))
> > > - it.it_value.tv_sec = SYN_TIMEOUT;
> > > + if (!(conn->events & ESTABLISHED)) {
> > > + int exp = conn->retries - c->tcp.syn_linear_timeouts;
> > > + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> > > + }
> > > else
> >
> > A non-obvious detail of the style we use is that if one branch of an
> > if uses { }, then the other branch should as well, even if it's one
> > line. So above should be "} else {" and the bit below changed to
> > match.
>
> Oh, I didn't know about this style. Will update it later.
Since it's gone in the next patch, it doesn't really matter.
> > > it.it_value.tv_sec = ACK_TIMEOUT;
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > tcp_timer_ctl(c, conn);
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > if (!(conn->events & ESTABLISHED)) {
> > > - flow_dbg(conn, "handshake timeout");
> > > - tcp_rst(c, conn);
> > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > + c->tcp.syn_linear_timeouts)) {
> >
> > Here's the theoretical bug. We only clamp tcp_syn_retries and
> > syn_linear_timeouts to a uint8_t, so if the host has these set to
> > strangely large values, this addition could overflow.
>
> Just checked net/ipv4/sysctl_net_ipv4.c,
>
> static int tcp_syn_retries_min = 1;
> static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
> static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
>
> And MAX_TCP_SYNCNT is defined as 127:
>
> #define MAX_TCP_SYNCNT 127
>
> they have a max value as 127, so there won't be overflows.
Ok. That's not at all obvious from within the passt code, though.
Again, this is really only a theoretical problem, but I think ideally
we'd clamp to 127 where we read the values out of the kernel, with a
comment saying that limit is derived from the kernel's own limit.
>
> >
> > > + flow_dbg(conn, "handshake timeout");
> > > + tcp_rst(c, conn);
> > > + } else {
> > > + flow_trace(conn, "SYN timeout, retry");
> > > + tcp_send_flag(c, conn, SYN);
> > > + conn->retries++;
> > > + tcp_timer_ctl(c, conn);
> > > + }
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > flow_dbg(conn, "FIN timeout");
> > > tcp_rst(c, conn);
> > > @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void)
> > > return sl;
> > > }
> > >
> > > +/**
> > > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > > + * @c: Execution context
> > > +*/
> > > +void tcp_get_rto_params(struct ctx *c)
> > > +{
> > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > +
> > > + tcp_syn_retries = read_file_integer(
> > > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > > + syn_linear_timeouts = read_file_integer(
> > > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > +
> > > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > > +
> > > + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> > > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > > +}
> > > +
> > > /**
> > > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > > * @c: Execution context
> > > @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c)
> > > {
> > > ASSERT(!c->no_tcp);
> > >
> > > + tcp_get_rto_params(c);
> > > +
> > > tcp_sock_iov_init(c);
> > >
> > > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > > diff --git a/tcp.h b/tcp.h
> > > index 234a803..befedde 100644
> > > --- a/tcp.h
> > > +++ b/tcp.h
> > > @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref {
> > > * @fwd_out: Port forwarding configuration for outbound packets
> > > * @timer_run: Timestamp of most recent timer run
> > > * @pipe_size: Size of pipes for spliced connections
> > > + * @tcp_syn_retries: SYN retries using exponential backoff timeout
> > > + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> > > */
> > > struct tcp_ctx {
> > > struct fwd_ports fwd_in;
> > > struct fwd_ports fwd_out;
> > > struct timespec timer_run;
> > > size_t pipe_size;
> > > + uint8_t tcp_syn_retries;
> > > + uint8_t syn_linear_timeouts;
> > > };
> > >
> > > #endif /* TCP_H */
> > > --
> > > 2.49.0
> > >
> >
> > --
> > David Gibson (he or they) | I'll have my music baroque, and my code
> > david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> > | around.
> > http://www.ozlabs.org/~dgibson
>
>
>
> --
> Thanks,
>
> Yumei Huang
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-11-03 2:57 ` Yumei Huang
@ 2025-11-03 9:32 ` David Gibson
2025-11-04 4:42 ` Stefano Brivio
0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-11-03 9:32 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 5809 bytes --]
On Mon, Nov 03, 2025 at 10:57:27AM +0800, Yumei Huang wrote:
> On Mon, Nov 3, 2025 at 9:38 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
> > > Use an exponential backoff timeout for data retransmission according
> > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > in Appendix A of RFC 6298.
> > >
> > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > >
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> > As reported, the carried over R-b was a minor mistake, since the code
> > has changed, but here's a new one:
> >
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >
> > Small comment below, though.
> >
> > > ---
> > > tcp.c | 30 ++++++++++++------------------
> > > 1 file changed, 12 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index bada88a..96ee56a 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -179,16 +179,13 @@
> > > *
> > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > *
> > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > - * SYN. It's the starting timeout for the first SYN retry. Retry for
> > > - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> > > - * reset the connection
> > > - *
> > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times
> > > + * for established connections, or (tcp_syn_retries +
> > > + * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > > *
> > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > @@ -342,8 +339,7 @@ enum {
> > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > >
> > > #define ACK_INTERVAL 10 /* ms */
> > > -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> > > -#define ACK_TIMEOUT 2
> > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > #define FIN_TIMEOUT 60
> > > #define ACT_TIMEOUT 7200
> > >
> > > @@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > - if (!(conn->events & ESTABLISHED)) {
> > > - int exp = conn->retries - c->tcp.syn_linear_timeouts;
> >
> > I didn't spot it in the previous patch, but this is (theoretically)
> > buggy. conn->retries is unsigned, so the subtraction will be
> > performed unsigned and only then cast to signed. I think that will
> > probably do the right thing in practice, but I don't think that's
> > guaranteed by the C standard (and might even be UB).
>
> I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries
> and c->tcp.syn_linear_timeouts) will go through integer promotion
> before subtraction. So the line is like:
>
> int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts;
>
> Please correct me if I'm wrong.
Huh, I thought it would only be promoted if one of the operands was an
int. But C promotion rules are really confusing, so I could well be
wrong.
>
> >
> > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> > > - }
> > > - else
> > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > + int exp = conn->retries;
> >
> > This change fixes it, by forcing the cast to a signed int before the
> > subtraction. It also removes the minor style error I noted in the
> > previous patch. Given that, I don't think we need to worry about
> > either of them.
> >
> > > + if (!(conn->events & ESTABLISHED))
> > > + exp -= c->tcp.syn_linear_timeouts;
> > > + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > } else {
> > > --
> > > 2.49.0
> > >
> >
> > --
> > David Gibson (he or they) | I'll have my music baroque, and my code
> > david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> > | around.
> > http://www.ozlabs.org/~dgibson
>
>
>
> --
> Thanks,
>
> Yumei Huang
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-11-03 3:11 ` Yumei Huang
@ 2025-11-03 9:37 ` David Gibson
2025-11-03 10:55 ` Stefano Brivio
1 sibling, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-11-03 9:37 UTC (permalink / raw)
To: Yumei Huang; +Cc: Stefano Brivio, passt-dev
[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]
On Mon, Nov 03, 2025 at 11:11:10AM +0800, Yumei Huang wrote:
> On Fri, Oct 31, 2025 at 4:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 31 Oct 2025 13:42:42 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
[snip]
> > > diff --git a/tcp.h b/tcp.h
> > > index befedde..a238bb7 100644
> > > --- a/tcp.h
> > > +++ b/tcp.h
> > > @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
> > > * @fwd_out: Port forwarding configuration for outbound packets
> > > * @timer_run: Timestamp of most recent timer run
> > > * @pipe_size: Size of pipes for spliced connections
> > > + * @tcp_rto_max: Maximal retry timeout (in s)
> >
> > Nit: "maximal" has a slightly different meaning compared to "maximum".
> >
> > The highest value allowed for a field would typically be called
> > "maximum", while "maximal" is more commonly used to indicate a value /
> > element that's the biggest of all values. Yes, I know, it's complicated.
>
> Yeah, it's complicated. Actually I get the word from networking/ip-sysctl.rst.
>
> tcp_rto_max_ms - INTEGER
> Maximal TCP retransmission timeout (in ms).
>
> "maximum" might be more appropriate as you explained.
In mathematics, the difference is well defined. "maximal" means
nothing else is bigger than it, "maximum" means everything else is
smaller than it. Those are the same thing for a total order, but not
for a partial order.
Since integers have a total order, either would be correct here,
according to mathematician's English.
Of course, that only partly overlaps with everyday English usage. I
think either would be fine, but "maximum" is probably slightly better.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-11-03 1:37 ` David Gibson
2025-11-03 4:06 ` Yumei Huang
@ 2025-11-03 10:38 ` Stefano Brivio
2025-11-05 4:22 ` David Gibson
1 sibling, 1 reply; 34+ messages in thread
From: Stefano Brivio @ 2025-11-03 10:38 UTC (permalink / raw)
To: David Gibson; +Cc: Yumei Huang, passt-dev
On Mon, 3 Nov 2025 12:37:52 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
> > Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> > than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> > according to RFC 6298.
> >
> > Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > ---
> > tcp.c | 25 ++++++++++++++++++++-----
> > tcp.h | 2 ++
> > 2 files changed, 22 insertions(+), 5 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 96ee56a..84a6700 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -187,6 +187,9 @@
> > * for established connections, or (tcp_syn_retries +
> > * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > *
> > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
> > + * data retransmissions.
> > + *
> > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > * the connection
> > @@ -340,6 +343,7 @@ enum {
> >
> > #define ACK_INTERVAL 10 /* ms */
> > #define RTO_INIT 1 /* s, RFC 6298 */
> > +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> >
> > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> >
> > #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
> >
> > #define TCP_SYN_RETRIES_DEFAULT 6
> > #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > +#define TCP_RTO_MAX_MS_DEFAULT 120000
> >
> > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > - int exp = conn->retries;
> > + int exp = conn->retries, timeout = RTO_INIT;
> > if (!(conn->events & ESTABLISHED))
> > exp -= c->tcp.syn_linear_timeouts;
> > - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > + else
> > + timeout = MAX(timeout, RTO_INIT_ACK);
>
> Possibly I missed something, since I only skimmed your discussion of
> this behaviour with Stefano. But I'm not convinced this is a correct
> interpretation of the RFC. (5.7) says "If the timer expires awaiting
> the ACK of a SYN segment ...". That is, I think it's only suggesting
> increasing the RTO to 3 at the data stage *if* we had at least one
> retry during the handshake.
Oops, true, my bad.
I guess I suggested the interpretation as of v7 because I just skimmed
Appendix A of RFC 6298 whose main function is to justify the reasons
behind lowering the initial timeout to one second, and I thought these
reason simply don't apply to the established phase, so we use three
seconds there.
But that's clearly not the case, hence the "When this happens" clause
in the middle of Appendix A.
> That is, unfortunately, much fiddlier to
> implement, since we need to remember what happened during the
> handshake to apply it here.
Hmm,
---
diff --git a/tcp.c b/tcp.c
index c1eb5de..90c3ca1 100644
--- a/tcp.c
+++ b/tcp.c
@@ -397,7 +397,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
static const char *tcp_flag_str[] __attribute((__unused__)) = {
"STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
- "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS",
+ "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", "SYN_RETRIED",
};
/* Listening sockets, used for automatic port forwarding in pasta mode only */
@@ -597,7 +597,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn)
int exp = conn->retries, timeout = RTO_INIT;
if (!(conn->events & ESTABLISHED))
exp -= c->tcp.syn_linear_timeouts;
- else
+ else if (conn->flags & SYN_RETRIED)
timeout = MAX(timeout, RTO_INIT_ACK);
timeout <<= MAX(exp, 0);
it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
@@ -2446,6 +2446,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
flow_trace(conn, "SYN timeout, retry");
tcp_send_flag(c, conn, SYN);
conn->retries++;
+ conn_flag(c, conn, SYN_RETRIED);
tcp_timer_ctl(c, conn);
}
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
diff --git a/tcp_conn.h b/tcp_conn.h
index c006d56..87f4a2d 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -77,6 +77,7 @@ struct tcp_tap_conn {
#define ACK_TO_TAP_DUE BIT(3)
#define ACK_FROM_TAP_DUE BIT(4)
#define ACK_FROM_TAP_BLOCKS BIT(5)
+#define SYN_RETRIED BIT(6)
#define SNDBUF_BITS 24
unsigned int sndbuf :SNDBUF_BITS;
---
?
> Additionally, if I'm reading the RFC correctly, it's treating this as
> a one-time adjustment of the RTO, which won't necessarily remain the
> case for the entire data phase. Here this minimum will apply for the
> entire data phase.
But it's the initial RTO (see Appendix A, which states it clearly), and
any exponentiation is based on the initial value, so this should fit
the requirement.
> Even though it's a "MUST" in the RFC, I kind of think we could just
> skip this for two reasons:
>
> 1) We already don't bother with RTT measurements, which the RFC
> assumes the implementation is doing to adjust the RTO.
Kind of:
(2.1) Until a round-trip time (RTT) measurement has been made for a
segment sent between the sender and receiver, the sender SHOULD
set RTO <- 1 second
and given that this condition (no round-trip time measurement done yet)
is explicitly considered, I guess we can reasonably expect TCP stacks we
might be talking to to be fully compatible with what we're doing, as
long as we stick to the RFC.
> 2) We expect to be talking to a guest. The chance of a high RTT is
> vanishingly small compared to a path to potentially any host on
> the 2011 internet.
...until we move the guest / container and we implement a TCP transport
(somewhat overdue) in place of the existing tap / UNIX domain /
vhost-user connection.
The chance is still small and the consequences of ignoring this part of
the RFC are, I'm fairly sure, negligible, but if it's that easy to
implement, should we really depart from it?
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-11-03 3:11 ` Yumei Huang
2025-11-03 9:37 ` David Gibson
@ 2025-11-03 10:55 ` Stefano Brivio
1 sibling, 0 replies; 34+ messages in thread
From: Stefano Brivio @ 2025-11-03 10:55 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Mon, 3 Nov 2025 11:11:10 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Fri, Oct 31, 2025 at 4:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 31 Oct 2025 13:42:42 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> > > than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> > > according to RFC 6298.
> > >
> > > Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > > tcp.c | 25 ++++++++++++++++++++-----
> > > tcp.h | 2 ++
> > > 2 files changed, 22 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 96ee56a..84a6700 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -187,6 +187,9 @@
> > > * for established connections, or (tcp_syn_retries +
> > > * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > > *
> > > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
> >
> > Nit: we started with BrE (British English) spelling and later tried to
> > keep that consistent for the sake of grep, so we'll usually use
> > "initialise" (for consistency, at this point).
>
> Got it.
> >
> > > + * data retransmissions.
> > > + *
> > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > * the connection
> > > @@ -340,6 +343,7 @@ enum {
> > >
> > > #define ACK_INTERVAL 10 /* ms */
> > > #define RTO_INIT 1 /* s, RFC 6298 */
> > > +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> > > #define FIN_TIMEOUT 60
> > > #define ACT_TIMEOUT 7200
> > >
> > > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> > >
> > > #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > > #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > > +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
> > >
> > > #define TCP_SYN_RETRIES_DEFAULT 6
> > > #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > > +#define TCP_RTO_MAX_MS_DEFAULT 120000
> > >
> > > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > - int exp = conn->retries;
> > > + int exp = conn->retries, timeout = RTO_INIT;
> > > if (!(conn->events & ESTABLISHED))
> > > exp -= c->tcp.syn_linear_timeouts;
> > > - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > > + else
> > > + timeout = MAX(timeout, RTO_INIT_ACK);
> > > + timeout <<= MAX(exp, 0);
> > > + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > } else {
> > > @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > > */
> > > void tcp_get_rto_params(struct ctx *c)
> > > {
> > > - intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
> > >
> > > tcp_syn_retries = read_file_integer(
> > > TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > > syn_linear_timeouts = read_file_integer(
> > > TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > + tcp_rto_max_ms = read_file_integer(
> > > + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
> > >
> > > c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > > c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > > + c->tcp.tcp_rto_max = MIN(
> > > + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
> >
> > size_t looks like a rather weird choice for tcp_rto_max: size_t is used
> > to represent the size of objects, and nothing else (not a timeout in
> > milliseconds).
> >
> > It's also an int in ipv4_net_table[], net/ipv4/sysctl_net_ipv4.c
> > (Linux kernel). Any reason for picking size_t here?
>
> No particular reason. Just thought it can be used for counting as
> well. I can change it to int in v8.
> >
> > I mean, it will work, it just looks weird (and wastes a tiny bit of space,
> > even though I guess we don't care about 4 bytes there).
> >
> > > - debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> > > - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > > + debug("Read sysctl values tcp_syn_retries: %"PRIu8
> > > + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu",
> > > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts,
> > > + c->tcp.tcp_rto_max);
> > > }
> > >
> > > /**
> > > diff --git a/tcp.h b/tcp.h
> > > index befedde..a238bb7 100644
> > > --- a/tcp.h
> > > +++ b/tcp.h
> > > @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
> > > * @fwd_out: Port forwarding configuration for outbound packets
> > > * @timer_run: Timestamp of most recent timer run
> > > * @pipe_size: Size of pipes for spliced connections
> > > + * @tcp_rto_max: Maximal retry timeout (in s)
> >
> > Nit: "maximal" has a slightly different meaning compared to "maximum".
> >
> > The highest value allowed for a field would typically be called
> > "maximum", while "maximal" is more commonly used to indicate a value /
> > element that's the biggest of all values. Yes, I know, it's complicated.
>
> Yeah, it's complicated. Actually I get the word from networking/ip-sysctl.rst.
>
> tcp_rto_max_ms - INTEGER
> Maximal TCP retransmission timeout (in ms).
Oh, that's from 1280c26228bd ("tcp: add tcp_rto_max_ms sysctl), so
probably from the French adjective "maximal" (in French, "maximum" is
only a noun).
> "maximum" might be more appropriate as you explained.
>
> >
> > > * @tcp_syn_retries: SYN retries using exponential backoff timeout
> > > * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> > > */
> > > @@ -67,6 +68,7 @@ struct tcp_ctx {
> > > struct fwd_ports fwd_out;
> > > struct timespec timer_run;
> > > size_t pipe_size;
> > > + size_t tcp_rto_max;
> > > uint8_t tcp_syn_retries;
> > > uint8_t syn_linear_timeouts;
> > > };
> >
> > The rest of the series looks good to me at a *very* quick glance, but I
> > can't claim I really reviewed it yet.
>
> Thank you. Please take your time if you'd like to review them further.
Yes, I already started, didn't quite finish yet.
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-11-03 9:32 ` David Gibson
@ 2025-11-04 4:42 ` Stefano Brivio
2025-11-05 4:19 ` David Gibson
0 siblings, 1 reply; 34+ messages in thread
From: Stefano Brivio @ 2025-11-04 4:42 UTC (permalink / raw)
To: David Gibson; +Cc: Yumei Huang, passt-dev
On Mon, 3 Nov 2025 20:32:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Mon, Nov 03, 2025 at 10:57:27AM +0800, Yumei Huang wrote:
> > On Mon, Nov 3, 2025 at 9:38 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
> > > > Use an exponential backoff timeout for data retransmission according
> > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > in Appendix A of RFC 6298.
> > > >
> > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > >
> > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > >
> > > As reported, the carried over R-b was a minor mistake, since the code
> > > has changed, but here's a new one:
> > >
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > >
> > > Small comment below, though.
> > >
> > > > ---
> > > > tcp.c | 30 ++++++++++++------------------
> > > > 1 file changed, 12 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/tcp.c b/tcp.c
> > > > index bada88a..96ee56a 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -179,16 +179,13 @@
> > > > *
> > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > *
> > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > - * SYN. It's the starting timeout for the first SYN retry. Retry for
> > > > - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> > > > - * reset the connection
> > > > - *
> > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times
> > > > + * for established connections, or (tcp_syn_retries +
> > > > + * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > > > *
> > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > @@ -342,8 +339,7 @@ enum {
> > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > >
> > > > #define ACK_INTERVAL 10 /* ms */
> > > > -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> > > > -#define ACK_TIMEOUT 2
> > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > #define FIN_TIMEOUT 60
> > > > #define ACT_TIMEOUT 7200
> > > >
> > > > @@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > - if (!(conn->events & ESTABLISHED)) {
> > > > - int exp = conn->retries - c->tcp.syn_linear_timeouts;
> > >
> > > I didn't spot it in the previous patch, but this is (theoretically)
> > > buggy. conn->retries is unsigned, so the subtraction will be
> > > performed unsigned and only then cast to signed. I think that will
> > > probably do the right thing in practice, but I don't think that's
> > > guaranteed by the C standard (and might even be UB).
> >
> > I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries
> > and c->tcp.syn_linear_timeouts) will go through integer promotion
> > before subtraction. So the line is like:
> >
> > int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts;
> >
> > Please correct me if I'm wrong.
>
> Huh, I thought it would only be promoted if one of the operands was an
> int.
I thought and I think so too, yes.
> But C promotion rules are really confusing, so I could well be wrong.
Some references linked at:
https://archives.passt.top/passt-dev/20240103080834.24fa0a7a@elisabeth/
and here, it's about paragraph 6.3.1.8, "Usual arithmetic conversions"
of C11 (that's what passt uses right now).
No double or float here, so this is the relevant text:
Otherwise, the integer promotions are performed on both operands.
Then the following rules are applied to the promoted operands:
If both operands have the same type, then no further conversion
is needed
now, are they really the same type? The standard doesn't define uint8_t.
If we see it as an unsigned int, with storage limited to 8 bits, then I
would call them the same type.
If we see c->tcp.syn_linear_timeouts as a character type instead
(6.3.1.1, "Boolean, characters, and integers"), then the integer
conversion rank of unsigned int (conn->retries) is greater than the rank
of char, and the same text applies (promotion of char to unsigned int).
But I'm fairly sure that's not the case. We used uint8_t, not short,
and not char.
All in all I don't think there's any possibility of promotion to a
signed int: for that, you would need one of the operands to be signed.
Or two unsigned chars/shorts (see examples below).
If the operation is performed unsigned, and that's what gcc appears to
do here, promoting both operands to unsigned int (32-bit):
---
$ cat pro_uh_oh_motion.c
#include <stdio.h>
#include <stdint.h>
int main()
{
uint8_t a = 0;
struct { unsigned b:3; } x = { 7 };
printf("%u %i\n", a - x.b, a - x.b);
}
$ gcc -o pro_uh_oh_motion{,.c}
$ ./pro_uh_oh_motion
4294967289 -7
---
...do we really have a problem with that? To me it looks like the
behaviour is defined, and it's what we want.
These pages:
https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules
https://wiki.sei.cmu.edu/confluence/display/c/EXP14-C.+Beware+of+integer+promotion+when+performing+bitwise+operations+on+integer+types+smaller+than+int
show some problematic examples. This one is close to our case:
unsigned short x = 45000, y = 50000;
unsigned int z = x * y;
but there we have implicit promotion of the unsigned shorts to
int, which doesn't happen in our case, because we already have an
unsigned int.
See also the examples with two unsigned shorts from:
https://cryptoservices.github.io/fde/2018/11/30/undefined-behavior.html
and there the operation would be performed as signed int.
On top of that, you have the fact that the original types can't store
the result of the operation. Here it's not the case.
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-10-31 5:42 ` [PATCH v7 3/5] tcp: Resend SYN for inbound connections Yumei Huang
2025-11-03 1:09 ` David Gibson
@ 2025-11-04 4:42 ` Stefano Brivio
2025-11-05 4:24 ` David Gibson
1 sibling, 1 reply; 34+ messages in thread
From: Stefano Brivio @ 2025-11-04 4:42 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Fri, 31 Oct 2025 13:42:40 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> If a client connects while guest is not connected or ready yet,
> resend SYN instead of just resetting connection after 10 seconds.
>
> Use the same backoff calculation for the timeout as Linux kernel.
>
> Link: https://bugs.passt.top/show_bug.cgi?id=153
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> ---
> tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> tcp.h | 4 ++++
> 2 files changed, 53 insertions(+), 8 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 2ec4b0c..bada88a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,9 +179,11 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> - * connection
> + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> + * SYN. It's the starting timeout for the first SYN retry. Retry for
> + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> + * reset the connection
> *
> * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> @@ -340,7 +342,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT 10 /* s */
> +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> #define ACK_TIMEOUT 2
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
> @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
>
> #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
>
> +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> +
> +#define TCP_SYN_RETRIES_DEFAULT 6
> +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> +
> /* "Extended" data (not stored in the flow table) for TCP flow migration */
> static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
>
> @@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - if (!(conn->events & ESTABLISHED))
> - it.it_value.tv_sec = SYN_TIMEOUT;
> + if (!(conn->events & ESTABLISHED)) {
> + int exp = conn->retries - c->tcp.syn_linear_timeouts;
> + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> + }
> else
> it.it_value.tv_sec = ACK_TIMEOUT;
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> tcp_timer_ctl(c, conn);
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> if (!(conn->events & ESTABLISHED)) {
> - flow_dbg(conn, "handshake timeout");
> - tcp_rst(c, conn);
> + if (conn->retries >= TCP_MAX_RETRIES ||
> + conn->retries >= (c->tcp.tcp_syn_retries +
> + c->tcp.syn_linear_timeouts)) {
This:
int max;
max = c->tcp.syn_retries + c->tcp.syn_linear_timeouts;
max = MIN(TCP_MAX_RETRIES, max);
if (conn->retries >= max) {
is one more line but a bit easier to understand. Not a strong
preference on my side though.
> + flow_dbg(conn, "handshake timeout");
> + tcp_rst(c, conn);
> + } else {
> + flow_trace(conn, "SYN timeout, retry");
> + tcp_send_flag(c, conn, SYN);
> + conn->retries++;
> + tcp_timer_ctl(c, conn);
> + }
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> flow_dbg(conn, "FIN timeout");
> tcp_rst(c, conn);
> @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void)
> return sl;
> }
>
> +/**
> + * tcp_get_rto_params() - Get host kernel RTO parameters
> + * @c: Execution context
> +*/
> +void tcp_get_rto_params(struct ctx *c)
> +{
> + intmax_t tcp_syn_retries, syn_linear_timeouts;
> +
> + tcp_syn_retries = read_file_integer(
> + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> + syn_linear_timeouts = read_file_integer(
> + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
I think this is a bit hard to read. Now:
tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES,
TCP_SYN_RETRIES_DEFAULT);
syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS,
TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
would be a bit closer to the coding style we're adopting, but I wonder:
- does read_file_integer() really need to have "integer" in the name?
In a language where integers are called 'int', perhaps
read_file_int() is clear enough?
- the constants are defined here, in tcp.c, so they obviously refer to
TCP, so maybe SYN_LINEAR_TIMEOUTS is clear enough
- you don't really need to store both values in two different
variables, one is enough as you're assigning them right away. And:
v = read_file_int(SYN_RETRIES, SYN_RETRIES_DEFAULT);
c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_int(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
is four lines instead of six, and more readable if you ask me.
> +
> + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> +
> + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> +}
> +
> /**
> * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> * @c: Execution context
> @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c)
> {
> ASSERT(!c->no_tcp);
>
> + tcp_get_rto_params(c);
> +
> tcp_sock_iov_init(c);
>
> memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> diff --git a/tcp.h b/tcp.h
> index 234a803..befedde 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref {
> * @fwd_out: Port forwarding configuration for outbound packets
> * @timer_run: Timestamp of most recent timer run
> * @pipe_size: Size of pipes for spliced connections
> + * @tcp_syn_retries: SYN retries using exponential backoff timeout
> + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> */
> struct tcp_ctx {
> struct fwd_ports fwd_in;
> struct fwd_ports fwd_out;
> struct timespec timer_run;
> size_t pipe_size;
> + uint8_t tcp_syn_retries;
Why 'tcp' again?
> + uint8_t syn_linear_timeouts;
> };
>
> #endif /* TCP_H */
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-03 9:01 ` David Gibson
@ 2025-11-04 4:42 ` Stefano Brivio
0 siblings, 0 replies; 34+ messages in thread
From: Stefano Brivio @ 2025-11-04 4:42 UTC (permalink / raw)
To: David Gibson; +Cc: Yumei Huang, passt-dev
On Mon, 3 Nov 2025 20:01:46 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Mon, Nov 03, 2025 at 10:31:21AM +0800, Yumei Huang wrote:
> > On Mon, Nov 3, 2025 at 9:10 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
> > > > If a client connects while guest is not connected or ready yet,
> > > > resend SYN instead of just resetting connection after 10 seconds.
> > > >
> > > > Use the same backoff calculation for the timeout as Linux kernel.
> > > >
> > > > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > >
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > >
> > > There are a couple of tiny style nits and a mostly theoretical bug
> > > noted below.
> > >
> > >
> > > > ---
> > > > tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> > > > tcp.h | 4 ++++
> > > > 2 files changed, 53 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 2ec4b0c..bada88a 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -179,9 +179,11 @@
> > > > *
> > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > *
> > > > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > > > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > > > - * connection
> > > > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > >
> > > Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
> > >
> > > > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > + * SYN. It's the starting timeout for the first SYN retry. Retry for
> > > > + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times,
> > > > + * reset the connection
> > > > *
> > > > * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > @@ -340,7 +342,7 @@ enum {
> > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > >
> > > > #define ACK_INTERVAL 10 /* ms */
> > > > -#define SYN_TIMEOUT 10 /* s */
> > > > +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */
> > > > #define ACK_TIMEOUT 2
> > > > #define FIN_TIMEOUT 60
> > > > #define ACT_TIMEOUT 7200
> > > > @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> > > >
> > > > #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> > > >
> > > > +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > > > +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > > > +
> > > > +#define TCP_SYN_RETRIES_DEFAULT 6
> > > > +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > > > +
> > > > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > > >
> > > > @@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > - if (!(conn->events & ESTABLISHED))
> > > > - it.it_value.tv_sec = SYN_TIMEOUT;
> > > > + if (!(conn->events & ESTABLISHED)) {
> > > > + int exp = conn->retries - c->tcp.syn_linear_timeouts;
> > > > + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0);
> > > > + }
> > > > else
> > >
> > > A non-obvious detail of the style we use is that if one branch of an
> > > if uses { }, then the other branch should as well, even if it's one
> > > line. So above should be "} else {" and the bit below changed to
> > > match.
> >
> > Oh, I didn't know about this style. Will update it later.
>
> Since it's gone in the next patch, it doesn't really matter.
>
> > > > it.it_value.tv_sec = ACK_TIMEOUT;
> > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > > tcp_timer_ctl(c, conn);
> > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > if (!(conn->events & ESTABLISHED)) {
> > > > - flow_dbg(conn, "handshake timeout");
> > > > - tcp_rst(c, conn);
> > > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > > + c->tcp.syn_linear_timeouts)) {
> > >
> > > Here's the theoretical bug. We only clamp tcp_syn_retries and
> > > syn_linear_timeouts to a uint8_t, so if the host has these set to
> > > strangely large values, this addition could overflow.
Actually, I think it won't, because it's performed as 'unsigned int',
see my other email on the subject. But in any case:
> > Just checked net/ipv4/sysctl_net_ipv4.c,
> >
> > static int tcp_syn_retries_min = 1;
> > static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
> > static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
> >
> > And MAX_TCP_SYNCNT is defined as 127:
> >
> > #define MAX_TCP_SYNCNT 127
...these limits come and go, so...
> > they have a max value as 127, so there won't be overflows.
>
> Ok. That's not at all obvious from within the passt code, though.
> Again, this is really only a theoretical problem, but I think ideally
> we'd clamp to 127 where we read the values out of the kernel, with a
> comment saying that limit is derived from the kernel's own limit.
that would sound like a good idea anyway, instead of clamping to 255 as
done later in this patch.
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-10-31 5:42 ` [PATCH v7 5/5] tcp: Clamp the retry timeout Yumei Huang
2025-10-31 8:38 ` Stefano Brivio
2025-11-03 1:37 ` David Gibson
@ 2025-11-04 4:42 ` Stefano Brivio
2 siblings, 0 replies; 34+ messages in thread
From: Stefano Brivio @ 2025-11-04 4:42 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Fri, 31 Oct 2025 13:42:42 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> according to RFC 6298.
>
> Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> ---
> tcp.c | 25 ++++++++++++++++++++-----
> tcp.h | 2 ++
> 2 files changed, 22 insertions(+), 5 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 96ee56a..84a6700 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -187,6 +187,9 @@
> * for established connections, or (tcp_syn_retries +
> * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> *
> + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
> + * data retransmissions.
> + *
> * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> * the connection
> @@ -340,6 +343,7 @@ enum {
>
> #define ACK_INTERVAL 10 /* ms */
> #define RTO_INIT 1 /* s, RFC 6298 */
> +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
>
> #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
Same as my previous comment on 3/5: you could skip "TCP_" in this name,
and:
>
> #define TCP_SYN_RETRIES_DEFAULT 6
> #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> +#define TCP_RTO_MAX_MS_DEFAULT 120000
here too, and:
> /* "Extended" data (not stored in the flow table) for TCP flow migration */
> static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - int exp = conn->retries;
> + int exp = conn->retries, timeout = RTO_INIT;
> if (!(conn->events & ESTABLISHED))
> exp -= c->tcp.syn_linear_timeouts;
> - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> + else
> + timeout = MAX(timeout, RTO_INIT_ACK);
> + timeout <<= MAX(exp, 0);
> + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
> @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void)
> */
> void tcp_get_rto_params(struct ctx *c)
> {
> - intmax_t tcp_syn_retries, syn_linear_timeouts;
> + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
>
> tcp_syn_retries = read_file_integer(
> TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> syn_linear_timeouts = read_file_integer(
> TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> + tcp_rto_max_ms = read_file_integer(
> + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
>
> c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> + c->tcp.tcp_rto_max = MIN(
> + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
>
> - debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> + debug("Read sysctl values tcp_syn_retries: %"PRIu8
> + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu",
> + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts,
> + c->tcp.tcp_rto_max);
> }
>
> /**
> diff --git a/tcp.h b/tcp.h
> index befedde..a238bb7 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref {
> * @fwd_out: Port forwarding configuration for outbound packets
> * @timer_run: Timestamp of most recent timer run
> * @pipe_size: Size of pipes for spliced connections
> + * @tcp_rto_max: Maximal retry timeout (in s)
> * @tcp_syn_retries: SYN retries using exponential backoff timeout
> * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> */
> @@ -67,6 +68,7 @@ struct tcp_ctx {
> struct fwd_ports fwd_out;
> struct timespec timer_run;
> size_t pipe_size;
> + size_t tcp_rto_max;
here too.
> uint8_t tcp_syn_retries;
> uint8_t syn_linear_timeouts;
> };
I finally finished reviewing, minus pending comments the series looks
good to me.
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 4/5] tcp: Update data retransmission timeout
2025-11-04 4:42 ` Stefano Brivio
@ 2025-11-05 4:19 ` David Gibson
0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-11-05 4:19 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yumei Huang, passt-dev
[-- Attachment #1: Type: text/plain, Size: 5745 bytes --]
On Tue, Nov 04, 2025 at 05:42:25AM +0100, Stefano Brivio wrote:
> On Mon, 3 Nov 2025 20:32:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Mon, Nov 03, 2025 at 10:57:27AM +0800, Yumei Huang wrote:
> > > On Mon, Nov 3, 2025 at 9:38 AM David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
[snip]
> > > > > @@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > - if (!(conn->events & ESTABLISHED)) {
> > > > > - int exp = conn->retries - c->tcp.syn_linear_timeouts;
> > > >
> > > > I didn't spot it in the previous patch, but this is (theoretically)
> > > > buggy. conn->retries is unsigned, so the subtraction will be
> > > > performed unsigned and only then cast to signed. I think that will
> > > > probably do the right thing in practice, but I don't think that's
> > > > guaranteed by the C standard (and might even be UB).
> > >
> > > I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries
> > > and c->tcp.syn_linear_timeouts) will go through integer promotion
> > > before subtraction. So the line is like:
> > >
> > > int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts;
> > >
> > > Please correct me if I'm wrong.
> >
> > Huh, I thought it would only be promoted if one of the operands was an
> > int.
>
> I thought and I think so too, yes.
>
> > But C promotion rules are really confusing, so I could well be wrong.
>
> Some references linked at:
>
> https://archives.passt.top/passt-dev/20240103080834.24fa0a7a@elisabeth/
>
> and here, it's about paragraph 6.3.1.8, "Usual arithmetic conversions"
> of C11 (that's what passt uses right now).
>
> No double or float here, so this is the relevant text:
>
> Otherwise, the integer promotions are performed on both operands.
> Then the following rules are applied to the promoted operands:
>
> If both operands have the same type, then no further conversion
> is needed
>
> now, are they really the same type? The standard doesn't define uint8_t.
> If we see it as an unsigned int, with storage limited to 8 bits, then I
> would call them the same type.
So, I was a bit confused in the first place - I had thought both
operands were literally uint8_t typed. But that's not the case,
'retries' is an unsigned int that's bit limited.
> If we see c->tcp.syn_linear_timeouts as a character type instead
> (6.3.1.1, "Boolean, characters, and integers"), then the integer
> conversion rank of unsigned int (conn->retries) is greater than the rank
> of char, and the same text applies (promotion of char to unsigned int).
>
> But I'm fairly sure that's not the case. We used uint8_t, not short,
> and not char.
>
> All in all I don't think there's any possibility of promotion to a
> signed int: for that, you would need one of the operands to be signed.
> Or two unsigned chars/shorts (see examples below).
Right. I'm fairly confident the arguments will be promoted to
unsigned int, not signed.
> If the operation is performed unsigned, and that's what gcc appears to
> do here, promoting both operands to unsigned int (32-bit):
>
> ---
> $ cat pro_uh_oh_motion.c
> #include <stdio.h>
> #include <stdint.h>
>
> int main()
> {
> uint8_t a = 0;
> struct { unsigned b:3; } x = { 7 };
>
> printf("%u %i\n", a - x.b, a - x.b);
> }
> $ gcc -o pro_uh_oh_motion{,.c}
> $ ./pro_uh_oh_motion
> 4294967289 -7
> ---
>
> ...do we really have a problem with that? To me it looks like the
> behaviour is defined, and it's what we want.
So, this is a second step, where we cast the result of the subtraction
into a signed int of the same size. I'm not sure if that behaviour is
defined for values outside the positive range the signed type can
represent. In practice the obvious result on a two's complement
machine will do what we need, but I'm not sure that C guarantees that.
>
> These pages:
>
> https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules
> https://wiki.sei.cmu.edu/confluence/display/c/EXP14-C.+Beware+of+integer+promotion+when+performing+bitwise+operations+on+integer+types+smaller+than+int
>
> show some problematic examples. This one is close to our case:
>
> unsigned short x = 45000, y = 50000;
> unsigned int z = x * y;
>
> but there we have implicit promotion of the unsigned shorts to
> int, which doesn't happen in our case, because we already have an
> unsigned int.
Also multiplication which is a whole other thing.
> See also the examples with two unsigned shorts from:
>
> https://cryptoservices.github.io/fde/2018/11/30/undefined-behavior.html
>
> and there the operation would be performed as signed int.
>
> On top of that, you have the fact that the original types can't store
> the result of the operation. Here it's not the case.
They can't though - the result is negative which can't be stored in an
unsigned type. Now, wraparound / mod 2^n behaviour *is* defined for
unsigned (but not signed) integer types. The second part where we
assign an unsigned "negative" value into a signed variable might not
be.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 5/5] tcp: Clamp the retry timeout
2025-11-03 10:38 ` Stefano Brivio
@ 2025-11-05 4:22 ` David Gibson
0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-11-05 4:22 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yumei Huang, passt-dev
[-- Attachment #1: Type: text/plain, Size: 7520 bytes --]
On Mon, Nov 03, 2025 at 11:38:57AM +0100, Stefano Brivio wrote:
> On Mon, 3 Nov 2025 12:37:52 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
> > > Clamp the TCP retry timeout as Linux kernel does. If RTO is less
> > > than 3 seconds, re-initialize it to 3 seconds for data retransmissions
> > > according to RFC 6298.
> > >
> > > Suggested-by: Stefano Brivio <sbrivio@redhat.com>
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > > tcp.c | 25 ++++++++++++++++++++-----
> > > tcp.h | 2 ++
> > > 2 files changed, 22 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 96ee56a..84a6700 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -187,6 +187,9 @@
> > > * for established connections, or (tcp_syn_retries +
> > > * tcp_syn_linear_timeouts) times during the handshake, reset the connection
> > > *
> > > + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
> > > + * data retransmissions.
> > > + *
> > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > * the connection
> > > @@ -340,6 +343,7 @@ enum {
> > >
> > > #define ACK_INTERVAL 10 /* ms */
> > > #define RTO_INIT 1 /* s, RFC 6298 */
> > > +#define RTO_INIT_ACK 3 /* s, RFC 6298 */
> > > #define FIN_TIMEOUT 60
> > > #define ACT_TIMEOUT 7200
> > >
> > > @@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> > >
> > > #define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > > #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts"
> > > +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
> > >
> > > #define TCP_SYN_RETRIES_DEFAULT 6
> > > #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4
> > > +#define TCP_RTO_MAX_MS_DEFAULT 120000
> > >
> > > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > > @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > - int exp = conn->retries;
> > > + int exp = conn->retries, timeout = RTO_INIT;
> > > if (!(conn->events & ESTABLISHED))
> > > exp -= c->tcp.syn_linear_timeouts;
> > > - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> > > + else
> > > + timeout = MAX(timeout, RTO_INIT_ACK);
> >
> > Possibly I missed something, since I only skimmed your discussion of
> > this behaviour with Stefano. But I'm not convinced this is a correct
> > interpretation of the RFC. (5.7) says "If the timer expires awaiting
> > the ACK of a SYN segment ...". That is, I think it's only suggesting
> > increasing the RTO to 3 at the data stage *if* we had at least one
> > retry during the handshake.
>
> Oops, true, my bad.
>
> I guess I suggested the interpretation as of v7 because I just skimmed
> Appendix A of RFC 6298 whose main function is to justify the reasons
> behind lowering the initial timeout to one second, and I thought these
> reason simply don't apply to the established phase, so we use three
> seconds there.
>
> But that's clearly not the case, hence the "When this happens" clause
> in the middle of Appendix A.
>
> > That is, unfortunately, much fiddlier to
> > implement, since we need to remember what happened during the
> > handshake to apply it here.
>
> Hmm,
>
> ---
> diff --git a/tcp.c b/tcp.c
> index c1eb5de..90c3ca1 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -397,7 +397,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
>
> static const char *tcp_flag_str[] __attribute((__unused__)) = {
> "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
> - "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS",
> + "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", "SYN_RETRIED",
> };
>
> /* Listening sockets, used for automatic port forwarding in pasta mode only */
> @@ -597,7 +597,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn)
> int exp = conn->retries, timeout = RTO_INIT;
> if (!(conn->events & ESTABLISHED))
> exp -= c->tcp.syn_linear_timeouts;
> - else
> + else if (conn->flags & SYN_RETRIED)
> timeout = MAX(timeout, RTO_INIT_ACK);
> timeout <<= MAX(exp, 0);
> it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max);
> @@ -2446,6 +2446,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> flow_trace(conn, "SYN timeout, retry");
> tcp_send_flag(c, conn, SYN);
> conn->retries++;
> + conn_flag(c, conn, SYN_RETRIED);
> tcp_timer_ctl(c, conn);
> }
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> diff --git a/tcp_conn.h b/tcp_conn.h
> index c006d56..87f4a2d 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -77,6 +77,7 @@ struct tcp_tap_conn {
> #define ACK_TO_TAP_DUE BIT(3)
> #define ACK_FROM_TAP_DUE BIT(4)
> #define ACK_FROM_TAP_BLOCKS BIT(5)
> +#define SYN_RETRIED BIT(6)
>
> #define SNDBUF_BITS 24
> unsigned int sndbuf :SNDBUF_BITS;
> ---
>
> ?
Ok, not as fiddly as I feared. \o/
> > Additionally, if I'm reading the RFC correctly, it's treating this as
> > a one-time adjustment of the RTO, which won't necessarily remain the
> > case for the entire data phase. Here this minimum will apply for the
> > entire data phase.
>
> But it's the initial RTO (see Appendix A, which states it clearly), and
> any exponentiation is based on the initial value, so this should fit
> the requirement.
>
> > Even though it's a "MUST" in the RFC, I kind of think we could just
> > skip this for two reasons:
> >
> > 1) We already don't bother with RTT measurements, which the RFC
> > assumes the implementation is doing to adjust the RTO.
>
> Kind of:
>
> (2.1) Until a round-trip time (RTT) measurement has been made for a
> segment sent between the sender and receiver, the sender SHOULD
> set RTO <- 1 second
>
> and given that this condition (no round-trip time measurement done yet)
> is explicitly considered, I guess we can reasonably expect TCP stacks we
> might be talking to to be fully compatible with what we're doing, as
> long as we stick to the RFC.
Good points, you convinced me.
> > 2) We expect to be talking to a guest. The chance of a high RTT is
> > vanishingly small compared to a path to potentially any host on
> > the 2011 internet.
>
> ...until we move the guest / container and we implement a TCP transport
> (somewhat overdue) in place of the existing tap / UNIX domain /
> vhost-user connection.
Right. Or even right now, if the guest then NATs the connection onto
a slow VPN link.
> The chance is still small and the consequences of ignoring this part of
> the RFC are, I'm fairly sure, negligible, but if it's that easy to
> implement, should we really depart from it?
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-04 4:42 ` Stefano Brivio
@ 2025-11-05 4:24 ` David Gibson
2025-11-05 7:00 ` Stefano Brivio
0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2025-11-05 4:24 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yumei Huang, passt-dev
[-- Attachment #1: Type: text/plain, Size: 3620 bytes --]
On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
> On Fri, 31 Oct 2025 13:42:40 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
[snip]
> > +/**
> > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > + * @c: Execution context
> > +*/
> > +void tcp_get_rto_params(struct ctx *c)
> > +{
> > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > +
> > + tcp_syn_retries = read_file_integer(
> > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > + syn_linear_timeouts = read_file_integer(
> > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
>
> I think this is a bit hard to read. Now:
>
> tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES,
> TCP_SYN_RETRIES_DEFAULT);
> syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS,
> TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
>
> would be a bit closer to the coding style we're adopting, but I wonder:
>
> - does read_file_integer() really need to have "integer" in the name?
> In a language where integers are called 'int', perhaps
> read_file_int() is clear enough?
I think the idea is that read_file_integer() can be used for any
(signed) integer type (with range checking performed after the call).
read_file_int() might suggest it reads exactly an 'int', not anything
bigger or smaller.
> - the constants are defined here, in tcp.c, so they obviously refer to
> TCP, so maybe SYN_LINEAR_TIMEOUTS is clear enough
>
> - you don't really need to store both values in two different
> variables, one is enough as you're assigning them right away. And:
>
> v = read_file_int(SYN_RETRIES, SYN_RETRIES_DEFAULT);
> c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
>
> v = read_file_int(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
>
> is four lines instead of six, and more readable if you ask me.
>
> > +
> > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > +
> > + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > +}
> > +
> > /**
> > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > * @c: Execution context
> > @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c)
> > {
> > ASSERT(!c->no_tcp);
> >
> > + tcp_get_rto_params(c);
> > +
> > tcp_sock_iov_init(c);
> >
> > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > diff --git a/tcp.h b/tcp.h
> > index 234a803..befedde 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref {
> > * @fwd_out: Port forwarding configuration for outbound packets
> > * @timer_run: Timestamp of most recent timer run
> > * @pipe_size: Size of pipes for spliced connections
> > + * @tcp_syn_retries: SYN retries using exponential backoff timeout
> > + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> > */
> > struct tcp_ctx {
> > struct fwd_ports fwd_in;
> > struct fwd_ports fwd_out;
> > struct timespec timer_run;
> > size_t pipe_size;
> > + uint8_t tcp_syn_retries;
>
> Why 'tcp' again?
>
> > + uint8_t syn_linear_timeouts;
> > };
> >
> > #endif /* TCP_H */
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-05 4:24 ` David Gibson
@ 2025-11-05 7:00 ` Stefano Brivio
2025-11-07 9:56 ` Yumei Huang
0 siblings, 1 reply; 34+ messages in thread
From: Stefano Brivio @ 2025-11-05 7:00 UTC (permalink / raw)
To: David Gibson; +Cc: Yumei Huang, passt-dev
On Wed, 5 Nov 2025 15:24:44 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
> > On Fri, 31 Oct 2025 13:42:40 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> [snip]
> > > +/**
> > > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > > + * @c: Execution context
> > > +*/
> > > +void tcp_get_rto_params(struct ctx *c)
> > > +{
> > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > +
> > > + tcp_syn_retries = read_file_integer(
> > > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > > + syn_linear_timeouts = read_file_integer(
> > > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> >
> > I think this is a bit hard to read. Now:
> >
> > tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES,
> > TCP_SYN_RETRIES_DEFAULT);
> > syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS,
> > TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> >
> > would be a bit closer to the coding style we're adopting, but I wonder:
> >
> > - does read_file_integer() really need to have "integer" in the name?
>
> > In a language where integers are called 'int', perhaps
> > read_file_int() is clear enough?
>
> I think the idea is that read_file_integer() can be used for any
> (signed) integer type (with range checking performed after the call).
> read_file_int() might suggest it reads exactly an 'int', not anything
> bigger or smaller.
Oh, I see. It could be read_file_num() then, even if it's slightly less
accurate, or it can even remain read_file_integer(). If something like
this:
v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT);
c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
fits 80 columns, I'm taking it as a sign that the function name isn't
exceedingly long. No particular preference from me.
> > - the constants are defined here, in tcp.c, so they obviously refer to
> > TCP, so maybe SYN_LINEAR_TIMEOUTS is clear enough
> >
> > - you don't really need to store both values in two different
> > variables, one is enough as you're assigning them right away. And:
> >
> > v = read_file_int(SYN_RETRIES, SYN_RETRIES_DEFAULT);
> > c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
> >
> > v = read_file_int(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> > c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
> >
> > is four lines instead of six, and more readable if you ask me.
> >
> > > +
> > > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > > +
> > > + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8,
> > > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > > +}
> > > +
> > > /**
> > > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > > * @c: Execution context
> > > @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c)
> > > {
> > > ASSERT(!c->no_tcp);
> > >
> > > + tcp_get_rto_params(c);
> > > +
> > > tcp_sock_iov_init(c);
> > >
> > > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > > diff --git a/tcp.h b/tcp.h
> > > index 234a803..befedde 100644
> > > --- a/tcp.h
> > > +++ b/tcp.h
> > > @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref {
> > > * @fwd_out: Port forwarding configuration for outbound packets
> > > * @timer_run: Timestamp of most recent timer run
> > > * @pipe_size: Size of pipes for spliced connections
> > > + * @tcp_syn_retries: SYN retries using exponential backoff timeout
> > > + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout
> > > */
> > > struct tcp_ctx {
> > > struct fwd_ports fwd_in;
> > > struct fwd_ports fwd_out;
> > > struct timespec timer_run;
> > > size_t pipe_size;
> > > + uint8_t tcp_syn_retries;
> >
> > Why 'tcp' again?
> >
> > > + uint8_t syn_linear_timeouts;
> > > };
> > >
> > > #endif /* TCP_H */
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-05 7:00 ` Stefano Brivio
@ 2025-11-07 9:56 ` Yumei Huang
2025-11-07 10:05 ` Stefano Brivio
0 siblings, 1 reply; 34+ messages in thread
From: Yumei Huang @ 2025-11-07 9:56 UTC (permalink / raw)
To: Stefano Brivio; +Cc: David Gibson, passt-dev, Laurent Vivier
On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Wed, 5 Nov 2025 15:24:44 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
> > > On Fri, 31 Oct 2025 13:42:40 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > [snip]
> > > > +/**
> > > > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > > > + * @c: Execution context
> > > > +*/
> > > > +void tcp_get_rto_params(struct ctx *c)
> > > > +{
> > > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > > +
> > > > + tcp_syn_retries = read_file_integer(
> > > > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > > > + syn_linear_timeouts = read_file_integer(
> > > > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > >
> > > I think this is a bit hard to read. Now:
> > >
> > > tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES,
> > > TCP_SYN_RETRIES_DEFAULT);
> > > syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS,
> > > TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > >
> > > would be a bit closer to the coding style we're adopting, but I wonder:
> > >
> > > - does read_file_integer() really need to have "integer" in the name?
> >
> > > In a language where integers are called 'int', perhaps
> > > read_file_int() is clear enough?
> >
> > I think the idea is that read_file_integer() can be used for any
> > (signed) integer type (with range checking performed after the call).
> > read_file_int() might suggest it reads exactly an 'int', not anything
> > bigger or smaller.
>
> Oh, I see. It could be read_file_num() then, even if it's slightly less
> accurate, or it can even remain read_file_integer(). If something like
> this:
>
> v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT);
> c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
>
> v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
>
> fits 80 columns, I'm taking it as a sign that the function name isn't
> exceedingly long. No particular preference from me.
The longest line just exceeds 80 columns. I can rename the function to
read_file_num() if there is no objection from David.
Another thing, while I was revising the patches, I noticed the
parameter "const struct ctx *c" was removed from tcp_timer_ctl() by
commit dd5302dd7bf51 ("tcp, flow: Replace per-connection in_epoll flag
with an epollid in flow_common") from Laurent, but we need it to
access c->tcp.syn_retries and c->tcp.syn_linear_timeouts. Should we
add it back?
(Adding Laurent in the loop.)
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-07 9:56 ` Yumei Huang
@ 2025-11-07 10:05 ` Stefano Brivio
2025-11-10 2:52 ` Yumei Huang
0 siblings, 1 reply; 34+ messages in thread
From: Stefano Brivio @ 2025-11-07 10:05 UTC (permalink / raw)
To: Yumei Huang; +Cc: David Gibson, passt-dev, Laurent Vivier
On Fri, 7 Nov 2025 17:56:54 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Wed, 5 Nov 2025 15:24:44 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
> > > > On Fri, 31 Oct 2025 13:42:40 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > [snip]
> > > > > +/**
> > > > > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > > > > + * @c: Execution context
> > > > > +*/
> > > > > +void tcp_get_rto_params(struct ctx *c)
> > > > > +{
> > > > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > > > +
> > > > > + tcp_syn_retries = read_file_integer(
> > > > > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > > > > + syn_linear_timeouts = read_file_integer(
> > > > > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > >
> > > > I think this is a bit hard to read. Now:
> > > >
> > > > tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES,
> > > > TCP_SYN_RETRIES_DEFAULT);
> > > > syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS,
> > > > TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > >
> > > > would be a bit closer to the coding style we're adopting, but I wonder:
> > > >
> > > > - does read_file_integer() really need to have "integer" in the name?
> > >
> > > > In a language where integers are called 'int', perhaps
> > > > read_file_int() is clear enough?
> > >
> > > I think the idea is that read_file_integer() can be used for any
> > > (signed) integer type (with range checking performed after the call).
> > > read_file_int() might suggest it reads exactly an 'int', not anything
> > > bigger or smaller.
> >
> > Oh, I see. It could be read_file_num() then, even if it's slightly less
> > accurate, or it can even remain read_file_integer(). If something like
> > this:
> >
> > v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT);
> > c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
> >
> > v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> > c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
> >
> > fits 80 columns, I'm taking it as a sign that the function name isn't
> > exceedingly long. No particular preference from me.
>
> The longest line just exceeds 80 columns.
No, why? It's 80 columns, look:
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
01234567890123456789012345678901234567890123456789012345678901234567890123456789
> I can rename the function to read_file_num() if there is no objection from David.
Sure, up to you and David.
> Another thing, while I was revising the patches, I noticed the
> parameter "const struct ctx *c" was removed from tcp_timer_ctl() by
> commit dd5302dd7bf51 ("tcp, flow: Replace per-connection in_epoll flag
> with an epollid in flow_common") from Laurent, but we need it to
> access c->tcp.syn_retries and c->tcp.syn_linear_timeouts. Should we
> add it back?
Yes, I don't see any other way around it, and I think I'll need it back
anyway for some fixes I'm working on.
It's one type of trivial conflict I would typically fix up on merge, by
the way, but given that you will send another version, you could/should
rebase on latest HEAD anyway.
> (Adding Laurent in the loop.)
--
Stefano
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-07 10:05 ` Stefano Brivio
@ 2025-11-10 2:52 ` Yumei Huang
2025-11-10 4:25 ` David Gibson
0 siblings, 1 reply; 34+ messages in thread
From: Yumei Huang @ 2025-11-10 2:52 UTC (permalink / raw)
To: Stefano Brivio; +Cc: David Gibson, passt-dev, Laurent Vivier
On Fri, Nov 7, 2025 at 6:05 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 7 Nov 2025 17:56:54 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Wed, 5 Nov 2025 15:24:44 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
> > > > > On Fri, 31 Oct 2025 13:42:40 +0800
> > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > [snip]
> > > > > > +/**
> > > > > > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > > > > > + * @c: Execution context
> > > > > > +*/
> > > > > > +void tcp_get_rto_params(struct ctx *c)
> > > > > > +{
> > > > > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > > > > +
> > > > > > + tcp_syn_retries = read_file_integer(
> > > > > > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > > > > > + syn_linear_timeouts = read_file_integer(
> > > > > > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > > >
> > > > > I think this is a bit hard to read. Now:
> > > > >
> > > > > tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES,
> > > > > TCP_SYN_RETRIES_DEFAULT);
> > > > > syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS,
> > > > > TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > > >
> > > > > would be a bit closer to the coding style we're adopting, but I wonder:
> > > > >
> > > > > - does read_file_integer() really need to have "integer" in the name?
> > > >
> > > > > In a language where integers are called 'int', perhaps
> > > > > read_file_int() is clear enough?
> > > >
> > > > I think the idea is that read_file_integer() can be used for any
> > > > (signed) integer type (with range checking performed after the call).
> > > > read_file_int() might suggest it reads exactly an 'int', not anything
> > > > bigger or smaller.
> > >
> > > Oh, I see. It could be read_file_num() then, even if it's slightly less
> > > accurate, or it can even remain read_file_integer(). If something like
> > > this:
> > >
> > > v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT);
> > > c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
> > >
> > > v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
> > >
> > > fits 80 columns, I'm taking it as a sign that the function name isn't
> > > exceedingly long. No particular preference from me.
> >
> > The longest line just exceeds 80 columns.
>
> No, why? It's 80 columns, look:
>
> v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> 01234567890123456789012345678901234567890123456789012345678901234567890123456789
You are right. Somehow I counted 81 columns. Then I will keep the name
as it is. (To me, read_file_num() sounds like it can read any kind of
number.)
>
> > I can rename the function to read_file_num() if there is no objection from David.
>
> Sure, up to you and David.
>
> > Another thing, while I was revising the patches, I noticed the
> > parameter "const struct ctx *c" was removed from tcp_timer_ctl() by
> > commit dd5302dd7bf51 ("tcp, flow: Replace per-connection in_epoll flag
> > with an epollid in flow_common") from Laurent, but we need it to
> > access c->tcp.syn_retries and c->tcp.syn_linear_timeouts. Should we
> > add it back?
>
> Yes, I don't see any other way around it, and I think I'll need it back
> anyway for some fixes I'm working on.
>
> It's one type of trivial conflict I would typically fix up on merge, by
> the way, but given that you will send another version, you could/should
> rebase on latest HEAD anyway.
Sure.
>
> > (Adding Laurent in the loop.)
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v7 3/5] tcp: Resend SYN for inbound connections
2025-11-10 2:52 ` Yumei Huang
@ 2025-11-10 4:25 ` David Gibson
0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2025-11-10 4:25 UTC (permalink / raw)
To: Yumei Huang; +Cc: Stefano Brivio, passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 3543 bytes --]
On Mon, Nov 10, 2025 at 10:52:05AM +0800, Yumei Huang wrote:
> On Fri, Nov 7, 2025 at 6:05 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 7 Nov 2025 17:56:54 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Wed, 5 Nov 2025 15:24:44 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
> > > > > > On Fri, 31 Oct 2025 13:42:40 +0800
> > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > [snip]
> > > > > > > +/**
> > > > > > > + * tcp_get_rto_params() - Get host kernel RTO parameters
> > > > > > > + * @c: Execution context
> > > > > > > +*/
> > > > > > > +void tcp_get_rto_params(struct ctx *c)
> > > > > > > +{
> > > > > > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > > > > > +
> > > > > > > + tcp_syn_retries = read_file_integer(
> > > > > > > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT);
> > > > > > > + syn_linear_timeouts = read_file_integer(
> > > > > > > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > > > >
> > > > > > I think this is a bit hard to read. Now:
> > > > > >
> > > > > > tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES,
> > > > > > TCP_SYN_RETRIES_DEFAULT);
> > > > > > syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS,
> > > > > > TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > > > >
> > > > > > would be a bit closer to the coding style we're adopting, but I wonder:
> > > > > >
> > > > > > - does read_file_integer() really need to have "integer" in the name?
> > > > >
> > > > > > In a language where integers are called 'int', perhaps
> > > > > > read_file_int() is clear enough?
> > > > >
> > > > > I think the idea is that read_file_integer() can be used for any
> > > > > (signed) integer type (with range checking performed after the call).
> > > > > read_file_int() might suggest it reads exactly an 'int', not anything
> > > > > bigger or smaller.
> > > >
> > > > Oh, I see. It could be read_file_num() then, even if it's slightly less
> > > > accurate, or it can even remain read_file_integer(). If something like
> > > > this:
> > > >
> > > > v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT);
> > > > c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
> > > >
> > > > v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> > > > c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
> > > >
> > > > fits 80 columns, I'm taking it as a sign that the function name isn't
> > > > exceedingly long. No particular preference from me.
> > >
> > > The longest line just exceeds 80 columns.
> >
> > No, why? It's 80 columns, look:
> >
> > v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT);
> > 01234567890123456789012345678901234567890123456789012345678901234567890123456789
>
> You are right. Somehow I counted 81 columns. Then I will keep the name
> as it is. (To me, read_file_num() sounds like it can read any kind of
> number.)
I also slightly prefer read_file_integer().
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2025-11-10 4:26 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-31 5:42 [PATCH v7 0/5] Retry SYNs for inbound connections Yumei Huang
2025-10-31 5:42 ` [PATCH v7 1/5] tcp: Rename "retrans" to "retries" Yumei Huang
2025-10-31 5:42 ` [PATCH v7 2/5] util: Introduce read_file() and read_file_integer() function Yumei Huang
2025-11-03 0:53 ` David Gibson
2025-10-31 5:42 ` [PATCH v7 3/5] tcp: Resend SYN for inbound connections Yumei Huang
2025-11-03 1:09 ` David Gibson
2025-11-03 2:31 ` Yumei Huang
2025-11-03 9:01 ` David Gibson
2025-11-04 4:42 ` Stefano Brivio
2025-11-04 4:42 ` Stefano Brivio
2025-11-05 4:24 ` David Gibson
2025-11-05 7:00 ` Stefano Brivio
2025-11-07 9:56 ` Yumei Huang
2025-11-07 10:05 ` Stefano Brivio
2025-11-10 2:52 ` Yumei Huang
2025-11-10 4:25 ` David Gibson
2025-10-31 5:42 ` [PATCH v7 4/5] tcp: Update data retransmission timeout Yumei Huang
2025-10-31 5:56 ` Yumei Huang
2025-10-31 8:04 ` Stefano Brivio
2025-11-03 1:18 ` David Gibson
2025-11-03 2:57 ` Yumei Huang
2025-11-03 9:32 ` David Gibson
2025-11-04 4:42 ` Stefano Brivio
2025-11-05 4:19 ` David Gibson
2025-10-31 5:42 ` [PATCH v7 5/5] tcp: Clamp the retry timeout Yumei Huang
2025-10-31 8:38 ` Stefano Brivio
2025-11-03 3:11 ` Yumei Huang
2025-11-03 9:37 ` David Gibson
2025-11-03 10:55 ` Stefano Brivio
2025-11-03 1:37 ` David Gibson
2025-11-03 4:06 ` Yumei Huang
2025-11-03 10:38 ` Stefano Brivio
2025-11-05 4:22 ` David Gibson
2025-11-04 4:42 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).