* [PATCH v6 0/4] Retry SYNs for inbound connections
@ 2025-10-17 6:28 Yumei Huang
2025-10-17 6:28 ` [PATCH v6 1/4] tcp: Rename "retrans" to "retries" Yumei Huang
` (3 more replies)
0 siblings, 4 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-17 6:28 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
When a client connects, SYN would be sent to guest only once. If the
guest is not connected or ready at that time, the connection will be
reset in 10s. These patches introduce the SYN retry mechanism using
the similar backoff timeout as linux kernel. Also update the data
retransmission timeout using the backoff timeout.
Yumei Huang (4):
tcp: Rename "retrans" to "retries"
util: Introduce read_file() and read_file_integer() function
tcp: Resend SYN for inbound connections
tcp: Update data retransmission timeout
tcp.c | 76 +++++++++++++++++++++++++++++++++++-------------
tcp.h | 5 ++++
tcp_conn.h | 12 ++++----
util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
util.h | 8 ++++++
5 files changed, 159 insertions(+), 26 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v6 1/4] tcp: Rename "retrans" to "retries"
2025-10-17 6:28 [PATCH v6 0/4] Retry SYNs for inbound connections Yumei Huang
@ 2025-10-17 6:28 ` Yumei Huang
2025-10-17 6:28 ` [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function Yumei Huang
` (2 subsequent siblings)
3 siblings, 0 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-17 6:28 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
Rename "retrans" to "retries" so it can be used for SYN retries.
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 12 ++++++------
tcp_conn.h | 12 ++++++------
2 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/tcp.c b/tcp.c
index 0f9e9b3..2ec4b0c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -186,7 +186,7 @@
* - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
* data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
* socket and reset sequence to what was acknowledged. If this persists for
- * more than TCP_MAX_RETRANS times in a row, reset the connection
+ * more than TCP_MAX_RETRIES times in a row, reset the connection
*
* - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
* with TAP_FIN_SENT event), and no ACK is received within this time, reset
@@ -1127,7 +1127,7 @@ static void tcp_update_seqack_from_tap(const struct ctx *c,
if (SEQ_LT(seq, conn->seq_to_tap))
conn_flag(c, conn, ACK_FROM_TAP_DUE);
- conn->retrans = 0;
+ conn->retries = 0;
conn->seq_ack_from_tap = seq;
}
}
@@ -2414,7 +2414,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
flow_dbg(conn, "FIN timeout");
tcp_rst(c, conn);
- } else if (conn->retrans == TCP_MAX_RETRANS) {
+ } else if (conn->retries == TCP_MAX_RETRIES) {
flow_dbg(conn, "retransmissions count exceeded");
tcp_rst(c, conn);
} else {
@@ -2423,7 +2423,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
if (!conn->wnd_from_tap)
conn->wnd_from_tap = 1; /* Zero-window probe */
- conn->retrans++;
+ conn->retries++;
if (tcp_rewind_seq(c, conn))
return;
@@ -3382,7 +3382,7 @@ static int tcp_flow_repair_opt(const struct tcp_tap_conn *conn,
int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
{
struct tcp_tap_transfer t = {
- .retrans = conn->retrans,
+ .retries = conn->retries,
.ws_from_tap = conn->ws_from_tap,
.ws_to_tap = conn->ws_to_tap,
.events = conn->events,
@@ -3662,7 +3662,7 @@ int tcp_flow_migrate_target(struct ctx *c, int fd)
memcpy(&flow->f.side, &t.side, sizeof(flow->f.side));
conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
- conn->retrans = t.retrans;
+ conn->retries = t.retries;
conn->ws_from_tap = t.ws_from_tap;
conn->ws_to_tap = t.ws_to_tap;
conn->events = t.events;
diff --git a/tcp_conn.h b/tcp_conn.h
index 38b5c54..e5c8146 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -13,7 +13,7 @@
* struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
* @f: Generic flow information
* @in_epoll: Is the connection in the epoll set?
- * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
+ * @retries: Number of retries occurred due to timeouts
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
@@ -38,9 +38,9 @@ struct tcp_tap_conn {
bool in_epoll :1;
-#define TCP_RETRANS_BITS 3
- unsigned int retrans :TCP_RETRANS_BITS;
-#define TCP_MAX_RETRANS MAX_FROM_BITS(TCP_RETRANS_BITS)
+#define TCP_RETRIES_BITS 3
+ unsigned int retries :TCP_RETRIES_BITS;
+#define TCP_MAX_RETRIES MAX_FROM_BITS(TCP_RETRIES_BITS)
#define TCP_WS_BITS 4 /* RFC 7323 */
#define TCP_WS_MAX 14
@@ -102,7 +102,7 @@ struct tcp_tap_conn {
* struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
* @pif: Interfaces for each side of the flow
* @side: Addresses and ports for each side of the flow
- * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT
+ * @retries: Number of retries occurred due to timeouts
* @ws_from_tap: Window scaling factor advertised from tap/guest
* @ws_to_tap: Window scaling factor advertised to tap/guest
* @events: Connection events, implying connection states
@@ -122,7 +122,7 @@ struct tcp_tap_transfer {
uint8_t pif[SIDES];
struct flowside side[SIDES];
- uint8_t retrans;
+ uint8_t retries;
uint8_t ws_from_tap;
uint8_t ws_to_tap;
uint8_t events;
--
2.47.0
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-17 6:28 [PATCH v6 0/4] Retry SYNs for inbound connections Yumei Huang
2025-10-17 6:28 ` [PATCH v6 1/4] tcp: Rename "retrans" to "retries" Yumei Huang
@ 2025-10-17 6:28 ` Yumei Huang
2025-10-19 10:07 ` Stefano Brivio
2025-10-23 23:04 ` Stefano Brivio
2025-10-17 6:28 ` [PATCH v6 3/4] tcp: Resend SYN for inbound connections Yumei Huang
2025-10-17 6:28 ` [PATCH v6 4/4] tcp: Update data retransmission timeout Yumei Huang
3 siblings, 2 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-17 6:28 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
---
util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
util.h | 8 ++++++
2 files changed, 92 insertions(+)
diff --git a/util.c b/util.c
index c492f90..5c8c4bc 100644
--- a/util.c
+++ b/util.c
@@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
return len == 0 ? 0 : -1;
}
+/**
+ * read_file() - Read contents of file into a buffer
+ * @path: File to read
+ * @buf: Buffer to store file contents
+ * @buf_size: Size of buffer
+ *
+ * Return: number of bytes read on success, -1 on any error, -2 on truncation
+*/
+ssize_t read_file(const char *path, char *buf, size_t buf_size)
+{
+ int fd = open(path, O_RDONLY | O_CLOEXEC);
+ size_t total_read = 0;
+ ssize_t rc;
+
+ if (fd < 0) {
+ warn_perror("Could not open %s", path);
+ return -1;
+ }
+
+ while (total_read < buf_size) {
+ rc = read(fd, buf + total_read, buf_size - total_read);
+
+ if (rc < 0) {
+ warn_perror("Couldn't read from %s", path);
+ close(fd);
+ return -1;
+ }
+
+ if (rc == 0)
+ break;
+
+ total_read += rc;
+ }
+
+ close(fd);
+
+ if (total_read == buf_size) {
+ warn("File %s truncated, buffer too small", path);
+ return -2;
+ }
+
+ buf[total_read] = '\0';
+
+ return total_read;
+}
+
+/**
+ * read_file_integer() - Read an integer value from a file
+ * @path: File to read
+ * @fallback: Default value if file can't be read
+ *
+ * Return: Integer value, fallback on failure
+*/
+intmax_t read_file_integer(const char *path, intmax_t fallback)
+{
+ char buf[INTMAX_STRLEN];
+ ssize_t bytes_read;
+ intmax_t value;
+ char *end;
+
+ bytes_read = read_file(path, buf, sizeof(buf));
+
+ if (bytes_read < 0)
+ return fallback;
+
+ if (bytes_read == 0) {
+ debug("Empty file %s", path);
+ return fallback;
+ }
+
+ errno = 0;
+ value = strtoimax(buf, &end, 10);
+ if (*end && *end != '\n') {
+ debug("Invalid format in %s", path);
+ return fallback;
+ }
+ if (errno) {
+ debug("Invalid value in %s: %s", path, buf);
+ return fallback;
+ }
+
+ return value;
+}
+
#ifdef __ia64__
/* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
* use the description from clone(2).
diff --git a/util.h b/util.h
index 22eaac5..3f9f296 100644
--- a/util.h
+++ b/util.h
@@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
int __daemon(int pidfile_fd, int devnull_fd);
int fls(unsigned long x);
int write_file(const char *path, const char *buf);
+ssize_t read_file(const char *path, char *buf, size_t buf_size);
+intmax_t read_file_integer(const char *path, intmax_t fallback);
int write_all_buf(int fd, const void *buf, size_t len);
int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
int read_all_buf(int fd, void *buf, size_t len);
@@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255.
+ * Plus 2 extra bytes for the sign and null terminator.
+ * See https://stackoverflow.com/a/10536254.
+ */
+#define INTMAX_STRLEN (sizeof(intmax_t) * 3 + 2)
+
/* inet address (- '\0') + port (u16) (- '\0') + ':' + '\0' */
#define SOCKADDR_INET_STRLEN \
(INET_ADDRSTRLEN-1 + UINT16_STRLEN-1 + sizeof(":"))
--
2.47.0
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-17 6:28 [PATCH v6 0/4] Retry SYNs for inbound connections Yumei Huang
2025-10-17 6:28 ` [PATCH v6 1/4] tcp: Rename "retrans" to "retries" Yumei Huang
2025-10-17 6:28 ` [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function Yumei Huang
@ 2025-10-17 6:28 ` Yumei Huang
2025-10-22 1:16 ` David Gibson
2025-10-23 23:04 ` Stefano Brivio
2025-10-17 6:28 ` [PATCH v6 4/4] tcp: Update data retransmission timeout Yumei Huang
3 siblings, 2 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-17 6:28 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
If a client connects while guest is not connected or ready yet,
resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
---
tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
tcp.h | 5 +++++
2 files changed, 52 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c
index 2ec4b0c..9385132 100644
--- a/tcp.c
+++ b/tcp.c
@@ -179,9 +179,11 @@
*
* Timeouts are implemented by means of timerfd timers, set based on flags:
*
- * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
- * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
- * connection
+ * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
+ * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
+ * SYN. It's the starting timeout for the first SYN retry. If this persists
+ * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
+ * tcp_syn_linear_timeouts) times in a row, reset the connection
*
* - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
* data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
@@ -340,7 +342,7 @@ enum {
#define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */
-#define SYN_TIMEOUT 10 /* s */
+#define SYN_TIMEOUT_INIT 1 /* s */
#define ACK_TIMEOUT 2
#define FIN_TIMEOUT 60
#define ACT_TIMEOUT 7200
@@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
+#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
+
/* "Extended" data (not stored in the flow table) for TCP flow migration */
static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
if (conn->flags & ACK_TO_TAP_DUE) {
it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
} else if (conn->flags & ACK_FROM_TAP_DUE) {
- if (!(conn->events & ESTABLISHED))
- it.it_value.tv_sec = SYN_TIMEOUT;
+ if (!(conn->events & ESTABLISHED)) {
+ if (conn->retries < c->tcp.syn_linear_timeouts)
+ it.it_value.tv_sec = SYN_TIMEOUT_INIT;
+ else
+ it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
+ (conn->retries - c->tcp.syn_linear_timeouts);
+ }
else
it.it_value.tv_sec = ACK_TIMEOUT;
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
@@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
tcp_timer_ctl(c, conn);
} else if (conn->flags & ACK_FROM_TAP_DUE) {
if (!(conn->events & ESTABLISHED)) {
- flow_dbg(conn, "handshake timeout");
- tcp_rst(c, conn);
+ if (conn->retries >= TCP_MAX_RETRIES ||
+ conn->retries >= (c->tcp.tcp_syn_retries +
+ c->tcp.syn_linear_timeouts)) {
+ flow_dbg(conn, "handshake timeout");
+ tcp_rst(c, conn);
+ } else {
+ flow_trace(conn, "SYN timeout, retry");
+ tcp_send_flag(c, conn, SYN);
+ conn->retries++;
+ tcp_timer_ctl(c, conn);
+ }
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
flow_dbg(conn, "FIN timeout");
tcp_rst(c, conn);
@@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
return sl;
}
+/**
+ * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
+ * @c: Execution context
+*/
+void tcp_syn_params_init(struct ctx *c)
+{
+ intmax_t tcp_syn_retries, syn_linear_timeouts;
+
+ tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
+ syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
+
+ c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
+ c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
+
+ debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
+}
+
/**
* tcp_init() - Get initial sequence, hash secret, initialise per-socket data
* @c: Execution context
@@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
{
ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c);
+
tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
diff --git a/tcp.h b/tcp.h
index 234a803..4369b52 100644
--- a/tcp.h
+++ b/tcp.h
@@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
* @fwd_out: Port forwarding configuration for outbound packets
* @timer_run: Timestamp of most recent timer run
* @pipe_size: Size of pipes for spliced connections
+ * @tcp_syn_retries: Number of SYN retries during handshake
+ * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
+ * before switching to exponential backoff timeout
*/
struct tcp_ctx {
struct fwd_ports fwd_in;
struct fwd_ports fwd_out;
struct timespec timer_run;
size_t pipe_size;
+ uint8_t tcp_syn_retries;
+ uint8_t syn_linear_timeouts;
};
#endif /* TCP_H */
--
2.47.0
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-17 6:28 [PATCH v6 0/4] Retry SYNs for inbound connections Yumei Huang
` (2 preceding siblings ...)
2025-10-17 6:28 ` [PATCH v6 3/4] tcp: Resend SYN for inbound connections Yumei Huang
@ 2025-10-17 6:28 ` Yumei Huang
2025-10-22 1:19 ` David Gibson
2025-10-23 23:04 ` Stefano Brivio
3 siblings, 2 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-17 6:28 UTC (permalink / raw)
To: passt-dev, sbrivio; +Cc: david, yuhuang
Use an exponential backoff timeout for data retransmission according
to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/tcp.c b/tcp.c
index 9385132..dc0ec6c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -179,16 +179,14 @@
*
* Timeouts are implemented by means of timerfd timers, set based on flags:
*
- * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
- * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
- * SYN. It's the starting timeout for the first SYN retry. If this persists
- * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
- * tcp_syn_linear_timeouts) times in a row, reset the connection
- *
- * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
- * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
- * socket and reset sequence to what was acknowledged. If this persists for
- * more than TCP_MAX_RETRIES times in a row, reset the connection
+ * - RTO_INIT: if no ACK segment was received from tap/guest, either during
+ * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
+ * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
+ * from the socket and reset sequence to what was acknowledged. This is the
+ * timeout for the first retry, in seconds. If this persists too many times
+ * in a row, reset the connection: TCP_MAX_RETRIES for established
+ * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
+ * handshake.
*
* - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
* with TAP_FIN_SENT event), and no ACK is received within this time, reset
@@ -342,8 +340,7 @@ enum {
#define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */
-#define SYN_TIMEOUT_INIT 1 /* s */
-#define ACK_TIMEOUT 2
+#define RTO_INIT 1 /* s, RFC 6298 */
#define FIN_TIMEOUT 60
#define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
} else if (conn->flags & ACK_FROM_TAP_DUE) {
if (!(conn->events & ESTABLISHED)) {
if (conn->retries < c->tcp.syn_linear_timeouts)
- it.it_value.tv_sec = SYN_TIMEOUT_INIT;
+ it.it_value.tv_sec = RTO_INIT;
else
- it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
+ it.it_value.tv_sec = RTO_INIT <<
(conn->retries - c->tcp.syn_linear_timeouts);
}
else
- it.it_value.tv_sec = ACK_TIMEOUT;
+ it.it_value.tv_sec = RTO_INIT << conn->retries;
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
it.it_value.tv_sec = FIN_TIMEOUT;
} else {
--
2.47.0
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-17 6:28 ` [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function Yumei Huang
@ 2025-10-19 10:07 ` Stefano Brivio
2025-10-21 9:32 ` Yumei Huang
2025-10-23 23:04 ` Stefano Brivio
1 sibling, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-19 10:07 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Fri, 17 Oct 2025 14:28:36 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> ---
> util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> util.h | 8 ++++++
> 2 files changed, 92 insertions(+)
>
> diff --git a/util.c b/util.c
> index c492f90..5c8c4bc 100644
> --- a/util.c
> +++ b/util.c
> @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> return len == 0 ? 0 : -1;
> }
>
> +/**
> + * read_file() - Read contents of file into a buffer
> + * @path: File to read
> + * @buf: Buffer to store file contents
> + * @buf_size: Size of buffer
> + *
> + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> +*/
> +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> +{
> + int fd = open(path, O_RDONLY | O_CLOEXEC);
> + size_t total_read = 0;
> + ssize_t rc;
> +
> + if (fd < 0) {
> + warn_perror("Could not open %s", path);
> + return -1;
> + }
> +
> + while (total_read < buf_size) {
> + rc = read(fd, buf + total_read, buf_size - total_read);
> +
> + if (rc < 0) {
> + warn_perror("Couldn't read from %s", path);
> + close(fd);
> + return -1;
> + }
> +
> + if (rc == 0)
> + break;
> +
> + total_read += rc;
> + }
> +
> + close(fd);
> +
> + if (total_read == buf_size) {
> + warn("File %s truncated, buffer too small", path);
> + return -2;
> + }
> +
> + buf[total_read] = '\0';
> +
> + return total_read;
> +}
> +
> +/**
> + * read_file_integer() - Read an integer value from a file
> + * @path: File to read
> + * @fallback: Default value if file can't be read
> + *
> + * Return: Integer value, fallback on failure
> +*/
> +intmax_t read_file_integer(const char *path, intmax_t fallback)
> +{
> + char buf[INTMAX_STRLEN];
> + ssize_t bytes_read;
> + intmax_t value;
> + char *end;
> +
> + bytes_read = read_file(path, buf, sizeof(buf));
> +
> + if (bytes_read < 0)
> + return fallback;
> +
> + if (bytes_read == 0) {
> + debug("Empty file %s", path);
> + return fallback;
> + }
> +
> + errno = 0;
> + value = strtoimax(buf, &end, 10);
> + if (*end && *end != '\n') {
> + debug("Invalid format in %s", path);
> + return fallback;
> + }
> + if (errno) {
> + debug("Invalid value in %s: %s", path, buf);
> + return fallback;
> + }
> +
> + return value;
> +}
> +
> #ifdef __ia64__
> /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> * use the description from clone(2).
> diff --git a/util.h b/util.h
> index 22eaac5..3f9f296 100644
> --- a/util.h
> +++ b/util.h
> @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> int __daemon(int pidfile_fd, int devnull_fd);
> int fls(unsigned long x);
> int write_file(const char *path, const char *buf);
> +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> +intmax_t read_file_integer(const char *path, intmax_t fallback);
> int write_all_buf(int fd, const void *buf, size_t len);
> int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> int read_all_buf(int fd, void *buf, size_t len);
> @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
>
> #define UINT16_STRLEN (sizeof("65535"))
>
> +/* Each byte expands to at most 3 decimal digits since 0xff == 255.
> + * Plus 2 extra bytes for the sign and null terminator.
> + * See https://stackoverflow.com/a/10536254.
This is not an acceptable form of attribution according to the
CC BY-SA 3.0 terms. See:
https://stackoverflow.com/help/licensing
https://creativecommons.org/licenses/by-sa/3.0/
and checksum.h in this tree for some examples of how to combine
different licensing terms in a single file, in a way that's
human-readable but still machine-friendly (for license / compliance
scanners such as REUSE).
As I commented on a previous version, anyway, I don't think we need
this at all. I guess my comment was ignored though.
Note that I didn't review this patch or the rest of this series, yet.
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-19 10:07 ` Stefano Brivio
@ 2025-10-21 9:32 ` Yumei Huang
2025-10-21 21:50 ` Stefano Brivio
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-21 9:32 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 17 Oct 2025 14:28:36 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > ---
> > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > util.h | 8 ++++++
> > 2 files changed, 92 insertions(+)
> >
> > diff --git a/util.c b/util.c
> > index c492f90..5c8c4bc 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > return len == 0 ? 0 : -1;
> > }
> >
> > +/**
> > + * read_file() - Read contents of file into a buffer
> > + * @path: File to read
> > + * @buf: Buffer to store file contents
> > + * @buf_size: Size of buffer
> > + *
> > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> > +*/
> > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > +{
> > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > + size_t total_read = 0;
> > + ssize_t rc;
> > +
> > + if (fd < 0) {
> > + warn_perror("Could not open %s", path);
> > + return -1;
> > + }
> > +
> > + while (total_read < buf_size) {
> > + rc = read(fd, buf + total_read, buf_size - total_read);
> > +
> > + if (rc < 0) {
> > + warn_perror("Couldn't read from %s", path);
> > + close(fd);
> > + return -1;
> > + }
> > +
> > + if (rc == 0)
> > + break;
> > +
> > + total_read += rc;
> > + }
> > +
> > + close(fd);
> > +
> > + if (total_read == buf_size) {
> > + warn("File %s truncated, buffer too small", path);
> > + return -2;
> > + }
> > +
> > + buf[total_read] = '\0';
> > +
> > + return total_read;
> > +}
> > +
> > +/**
> > + * read_file_integer() - Read an integer value from a file
> > + * @path: File to read
> > + * @fallback: Default value if file can't be read
> > + *
> > + * Return: Integer value, fallback on failure
> > +*/
> > +intmax_t read_file_integer(const char *path, intmax_t fallback)
> > +{
> > + char buf[INTMAX_STRLEN];
> > + ssize_t bytes_read;
> > + intmax_t value;
> > + char *end;
> > +
> > + bytes_read = read_file(path, buf, sizeof(buf));
> > +
> > + if (bytes_read < 0)
> > + return fallback;
> > +
> > + if (bytes_read == 0) {
> > + debug("Empty file %s", path);
> > + return fallback;
> > + }
> > +
> > + errno = 0;
> > + value = strtoimax(buf, &end, 10);
> > + if (*end && *end != '\n') {
> > + debug("Invalid format in %s", path);
> > + return fallback;
> > + }
> > + if (errno) {
> > + debug("Invalid value in %s: %s", path, buf);
> > + return fallback;
> > + }
> > +
> > + return value;
> > +}
> > +
> > #ifdef __ia64__
> > /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> > * use the description from clone(2).
> > diff --git a/util.h b/util.h
> > index 22eaac5..3f9f296 100644
> > --- a/util.h
> > +++ b/util.h
> > @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> > int __daemon(int pidfile_fd, int devnull_fd);
> > int fls(unsigned long x);
> > int write_file(const char *path, const char *buf);
> > +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> > +intmax_t read_file_integer(const char *path, intmax_t fallback);
> > int write_all_buf(int fd, const void *buf, size_t len);
> > int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> > int read_all_buf(int fd, void *buf, size_t len);
> > @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
> >
> > #define UINT16_STRLEN (sizeof("65535"))
> >
> > +/* Each byte expands to at most 3 decimal digits since 0xff == 255.
> > + * Plus 2 extra bytes for the sign and null terminator.
> > + * See https://stackoverflow.com/a/10536254.
>
> This is not an acceptable form of attribution according to the
> CC BY-SA 3.0 terms. See:
>
> https://stackoverflow.com/help/licensing
> https://creativecommons.org/licenses/by-sa/3.0/
>
> and checksum.h in this tree for some examples of how to combine
> different licensing terms in a single file, in a way that's
> human-readable but still machine-friendly (for license / compliance
> scanners such as REUSE).
>
> As I commented on a previous version, anyway, I don't think we need
> this at all. I guess my comment was ignored though.
I guess you meant the comment of suggesting using BUFSIZ in V2? David
replied as quote:
"We could use BUFSIZ, but it's massive overkill for
reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if
intmax_t were 128-bit)."
It's my fault. I should have waited to see if you have more comments
before sending the new version.
>
> Note that I didn't review this patch or the rest of this series, yet.
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-21 9:32 ` Yumei Huang
@ 2025-10-21 21:50 ` Stefano Brivio
2025-10-22 0:51 ` David Gibson
2025-10-22 0:55 ` Yumei Huang
0 siblings, 2 replies; 44+ messages in thread
From: Stefano Brivio @ 2025-10-21 21:50 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Tue, 21 Oct 2025 17:32:58 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 17 Oct 2025 14:28:36 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > util.h | 8 ++++++
> > > 2 files changed, 92 insertions(+)
> > >
> > > diff --git a/util.c b/util.c
> > > index c492f90..5c8c4bc 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > > return len == 0 ? 0 : -1;
> > > }
> > >
> > > +/**
> > > + * read_file() - Read contents of file into a buffer
> > > + * @path: File to read
> > > + * @buf: Buffer to store file contents
> > > + * @buf_size: Size of buffer
> > > + *
> > > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> > > +*/
> > > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > > +{
> > > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > > + size_t total_read = 0;
> > > + ssize_t rc;
> > > +
> > > + if (fd < 0) {
> > > + warn_perror("Could not open %s", path);
> > > + return -1;
> > > + }
> > > +
> > > + while (total_read < buf_size) {
> > > + rc = read(fd, buf + total_read, buf_size - total_read);
> > > +
> > > + if (rc < 0) {
> > > + warn_perror("Couldn't read from %s", path);
> > > + close(fd);
> > > + return -1;
> > > + }
> > > +
> > > + if (rc == 0)
> > > + break;
> > > +
> > > + total_read += rc;
> > > + }
> > > +
> > > + close(fd);
> > > +
> > > + if (total_read == buf_size) {
> > > + warn("File %s truncated, buffer too small", path);
> > > + return -2;
> > > + }
> > > +
> > > + buf[total_read] = '\0';
> > > +
> > > + return total_read;
> > > +}
> > > +
> > > +/**
> > > + * read_file_integer() - Read an integer value from a file
> > > + * @path: File to read
> > > + * @fallback: Default value if file can't be read
> > > + *
> > > + * Return: Integer value, fallback on failure
> > > +*/
> > > +intmax_t read_file_integer(const char *path, intmax_t fallback)
> > > +{
> > > + char buf[INTMAX_STRLEN];
> > > + ssize_t bytes_read;
> > > + intmax_t value;
> > > + char *end;
> > > +
> > > + bytes_read = read_file(path, buf, sizeof(buf));
> > > +
> > > + if (bytes_read < 0)
> > > + return fallback;
> > > +
> > > + if (bytes_read == 0) {
> > > + debug("Empty file %s", path);
> > > + return fallback;
> > > + }
> > > +
> > > + errno = 0;
> > > + value = strtoimax(buf, &end, 10);
> > > + if (*end && *end != '\n') {
> > > + debug("Invalid format in %s", path);
> > > + return fallback;
> > > + }
> > > + if (errno) {
> > > + debug("Invalid value in %s: %s", path, buf);
> > > + return fallback;
> > > + }
> > > +
> > > + return value;
> > > +}
> > > +
> > > #ifdef __ia64__
> > > /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> > > * use the description from clone(2).
> > > diff --git a/util.h b/util.h
> > > index 22eaac5..3f9f296 100644
> > > --- a/util.h
> > > +++ b/util.h
> > > @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> > > int __daemon(int pidfile_fd, int devnull_fd);
> > > int fls(unsigned long x);
> > > int write_file(const char *path, const char *buf);
> > > +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> > > +intmax_t read_file_integer(const char *path, intmax_t fallback);
> > > int write_all_buf(int fd, const void *buf, size_t len);
> > > int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> > > int read_all_buf(int fd, void *buf, size_t len);
> > > @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
> > >
> > > #define UINT16_STRLEN (sizeof("65535"))
> > >
> > > +/* Each byte expands to at most 3 decimal digits since 0xff == 255.
> > > + * Plus 2 extra bytes for the sign and null terminator.
> > > + * See https://stackoverflow.com/a/10536254.
> >
> > This is not an acceptable form of attribution according to the
> > CC BY-SA 3.0 terms. See:
> >
> > https://stackoverflow.com/help/licensing
> > https://creativecommons.org/licenses/by-sa/3.0/
> >
> > and checksum.h in this tree for some examples of how to combine
> > different licensing terms in a single file, in a way that's
> > human-readable but still machine-friendly (for license / compliance
> > scanners such as REUSE).
> >
> > As I commented on a previous version, anyway, I don't think we need
> > this at all. I guess my comment was ignored though.
>
> I guess you meant the comment of suggesting using BUFSIZ in V2?
Right, but on v4, and that was just Friday for everybody involved...
> David replied as quote:
>
> "We could use BUFSIZ, but it's massive overkill for
> reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if
> intmax_t were 128-bit)."
I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes
on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer
doesn't really make a difference.
It's not like we allocate that memory anyway, and I don't think any of
that memory (or unused holes on the stack we create) is prefetched. And
regardless of all that... we don't use these functions on any data
path, it's just during configuration. It can be (relatively) slow.
> It's my fault. I should have waited to see if you have more comments
> before sending the new version.
Thanks for understanding. My biggest problem with this is: where do I
comment now? On v4 or v6? Should I even read v5?
Well, anyway, I'll directly review v6 (in a bit) to keep things sane.
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-21 21:50 ` Stefano Brivio
@ 2025-10-22 0:51 ` David Gibson
2025-10-22 8:42 ` Yumei Huang
2025-10-22 0:55 ` Yumei Huang
1 sibling, 1 reply; 44+ messages in thread
From: David Gibson @ 2025-10-22 0:51 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yumei Huang, passt-dev
[-- Attachment #1: Type: text/plain, Size: 6811 bytes --]
On Tue, Oct 21, 2025 at 11:50:59PM +0200, Stefano Brivio wrote:
> On Tue, 21 Oct 2025 17:32:58 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Fri, 17 Oct 2025 14:28:36 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > ---
> > > > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > util.h | 8 ++++++
> > > > 2 files changed, 92 insertions(+)
> > > >
> > > > diff --git a/util.c b/util.c
> > > > index c492f90..5c8c4bc 100644
> > > > --- a/util.c
> > > > +++ b/util.c
> > > > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > > > return len == 0 ? 0 : -1;
> > > > }
> > > >
> > > > +/**
> > > > + * read_file() - Read contents of file into a buffer
> > > > + * @path: File to read
> > > > + * @buf: Buffer to store file contents
> > > > + * @buf_size: Size of buffer
> > > > + *
> > > > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> > > > +*/
> > > > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > > > +{
> > > > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > > > + size_t total_read = 0;
> > > > + ssize_t rc;
> > > > +
> > > > + if (fd < 0) {
> > > > + warn_perror("Could not open %s", path);
> > > > + return -1;
> > > > + }
> > > > +
> > > > + while (total_read < buf_size) {
> > > > + rc = read(fd, buf + total_read, buf_size - total_read);
> > > > +
> > > > + if (rc < 0) {
> > > > + warn_perror("Couldn't read from %s", path);
> > > > + close(fd);
> > > > + return -1;
> > > > + }
> > > > +
> > > > + if (rc == 0)
> > > > + break;
> > > > +
> > > > + total_read += rc;
> > > > + }
> > > > +
> > > > + close(fd);
> > > > +
> > > > + if (total_read == buf_size) {
> > > > + warn("File %s truncated, buffer too small", path);
> > > > + return -2;
> > > > + }
> > > > +
> > > > + buf[total_read] = '\0';
> > > > +
> > > > + return total_read;
> > > > +}
> > > > +
> > > > +/**
> > > > + * read_file_integer() - Read an integer value from a file
> > > > + * @path: File to read
> > > > + * @fallback: Default value if file can't be read
> > > > + *
> > > > + * Return: Integer value, fallback on failure
> > > > +*/
> > > > +intmax_t read_file_integer(const char *path, intmax_t fallback)
> > > > +{
> > > > + char buf[INTMAX_STRLEN];
> > > > + ssize_t bytes_read;
> > > > + intmax_t value;
> > > > + char *end;
> > > > +
> > > > + bytes_read = read_file(path, buf, sizeof(buf));
> > > > +
> > > > + if (bytes_read < 0)
> > > > + return fallback;
> > > > +
> > > > + if (bytes_read == 0) {
> > > > + debug("Empty file %s", path);
> > > > + return fallback;
> > > > + }
> > > > +
> > > > + errno = 0;
> > > > + value = strtoimax(buf, &end, 10);
> > > > + if (*end && *end != '\n') {
> > > > + debug("Invalid format in %s", path);
> > > > + return fallback;
> > > > + }
> > > > + if (errno) {
> > > > + debug("Invalid value in %s: %s", path, buf);
> > > > + return fallback;
> > > > + }
> > > > +
> > > > + return value;
> > > > +}
> > > > +
> > > > #ifdef __ia64__
> > > > /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> > > > * use the description from clone(2).
> > > > diff --git a/util.h b/util.h
> > > > index 22eaac5..3f9f296 100644
> > > > --- a/util.h
> > > > +++ b/util.h
> > > > @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> > > > int __daemon(int pidfile_fd, int devnull_fd);
> > > > int fls(unsigned long x);
> > > > int write_file(const char *path, const char *buf);
> > > > +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> > > > +intmax_t read_file_integer(const char *path, intmax_t fallback);
> > > > int write_all_buf(int fd, const void *buf, size_t len);
> > > > int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> > > > int read_all_buf(int fd, void *buf, size_t len);
> > > > @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
> > > >
> > > > #define UINT16_STRLEN (sizeof("65535"))
> > > >
> > > > +/* Each byte expands to at most 3 decimal digits since 0xff == 255.
> > > > + * Plus 2 extra bytes for the sign and null terminator.
> > > > + * See https://stackoverflow.com/a/10536254.
> > >
> > > This is not an acceptable form of attribution according to the
> > > CC BY-SA 3.0 terms. See:
> > >
> > > https://stackoverflow.com/help/licensing
> > > https://creativecommons.org/licenses/by-sa/3.0/
> > >
> > > and checksum.h in this tree for some examples of how to combine
> > > different licensing terms in a single file, in a way that's
> > > human-readable but still machine-friendly (for license / compliance
> > > scanners such as REUSE).
> > >
> > > As I commented on a previous version, anyway, I don't think we need
> > > this at all. I guess my comment was ignored though.
> >
> > I guess you meant the comment of suggesting using BUFSIZ in V2?
>
> Right, but on v4, and that was just Friday for everybody involved...
>
> > David replied as quote:
> >
> > "We could use BUFSIZ, but it's massive overkill for
> > reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if
> > intmax_t were 128-bit)."
>
> I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes
> on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer
> doesn't really make a difference.
>
> It's not like we allocate that memory anyway, and I don't think any of
> that memory (or unused holes on the stack we create) is prefetched. And
> regardless of all that... we don't use these functions on any data
> path, it's just during configuration. It can be (relatively) slow.
Right. Sorry, I didn't follow up yet, but after I wrote that, Stefano
convinced me that BUFSIZ is fine. I originally suggested avoiding the
arbitrary buffer largely because I thought having an INTMAX_STRLEN
constant might be useful for other reasons too, but honestly, not
very.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-21 21:50 ` Stefano Brivio
2025-10-22 0:51 ` David Gibson
@ 2025-10-22 0:55 ` Yumei Huang
1 sibling, 0 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-22 0:55 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Wed, Oct 22, 2025 at 5:51 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Tue, 21 Oct 2025 17:32:58 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Fri, 17 Oct 2025 14:28:36 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > ---
> > > > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > util.h | 8 ++++++
> > > > 2 files changed, 92 insertions(+)
> > > >
> > > > diff --git a/util.c b/util.c
> > > > index c492f90..5c8c4bc 100644
> > > > --- a/util.c
> > > > +++ b/util.c
> > > > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > > > return len == 0 ? 0 : -1;
> > > > }
> > > >
> > > > +/**
> > > > + * read_file() - Read contents of file into a buffer
> > > > + * @path: File to read
> > > > + * @buf: Buffer to store file contents
> > > > + * @buf_size: Size of buffer
> > > > + *
> > > > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> > > > +*/
> > > > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > > > +{
> > > > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > > > + size_t total_read = 0;
> > > > + ssize_t rc;
> > > > +
> > > > + if (fd < 0) {
> > > > + warn_perror("Could not open %s", path);
> > > > + return -1;
> > > > + }
> > > > +
> > > > + while (total_read < buf_size) {
> > > > + rc = read(fd, buf + total_read, buf_size - total_read);
> > > > +
> > > > + if (rc < 0) {
> > > > + warn_perror("Couldn't read from %s", path);
> > > > + close(fd);
> > > > + return -1;
> > > > + }
> > > > +
> > > > + if (rc == 0)
> > > > + break;
> > > > +
> > > > + total_read += rc;
> > > > + }
> > > > +
> > > > + close(fd);
> > > > +
> > > > + if (total_read == buf_size) {
> > > > + warn("File %s truncated, buffer too small", path);
> > > > + return -2;
> > > > + }
> > > > +
> > > > + buf[total_read] = '\0';
> > > > +
> > > > + return total_read;
> > > > +}
> > > > +
> > > > +/**
> > > > + * read_file_integer() - Read an integer value from a file
> > > > + * @path: File to read
> > > > + * @fallback: Default value if file can't be read
> > > > + *
> > > > + * Return: Integer value, fallback on failure
> > > > +*/
> > > > +intmax_t read_file_integer(const char *path, intmax_t fallback)
> > > > +{
> > > > + char buf[INTMAX_STRLEN];
> > > > + ssize_t bytes_read;
> > > > + intmax_t value;
> > > > + char *end;
> > > > +
> > > > + bytes_read = read_file(path, buf, sizeof(buf));
> > > > +
> > > > + if (bytes_read < 0)
> > > > + return fallback;
> > > > +
> > > > + if (bytes_read == 0) {
> > > > + debug("Empty file %s", path);
> > > > + return fallback;
> > > > + }
> > > > +
> > > > + errno = 0;
> > > > + value = strtoimax(buf, &end, 10);
> > > > + if (*end && *end != '\n') {
> > > > + debug("Invalid format in %s", path);
> > > > + return fallback;
> > > > + }
> > > > + if (errno) {
> > > > + debug("Invalid value in %s: %s", path, buf);
> > > > + return fallback;
> > > > + }
> > > > +
> > > > + return value;
> > > > +}
> > > > +
> > > > #ifdef __ia64__
> > > > /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> > > > * use the description from clone(2).
> > > > diff --git a/util.h b/util.h
> > > > index 22eaac5..3f9f296 100644
> > > > --- a/util.h
> > > > +++ b/util.h
> > > > @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> > > > int __daemon(int pidfile_fd, int devnull_fd);
> > > > int fls(unsigned long x);
> > > > int write_file(const char *path, const char *buf);
> > > > +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> > > > +intmax_t read_file_integer(const char *path, intmax_t fallback);
> > > > int write_all_buf(int fd, const void *buf, size_t len);
> > > > int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> > > > int read_all_buf(int fd, void *buf, size_t len);
> > > > @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
> > > >
> > > > #define UINT16_STRLEN (sizeof("65535"))
> > > >
> > > > +/* Each byte expands to at most 3 decimal digits since 0xff == 255.
> > > > + * Plus 2 extra bytes for the sign and null terminator.
> > > > + * See https://stackoverflow.com/a/10536254.
> > >
> > > This is not an acceptable form of attribution according to the
> > > CC BY-SA 3.0 terms. See:
> > >
> > > https://stackoverflow.com/help/licensing
> > > https://creativecommons.org/licenses/by-sa/3.0/
> > >
> > > and checksum.h in this tree for some examples of how to combine
> > > different licensing terms in a single file, in a way that's
> > > human-readable but still machine-friendly (for license / compliance
> > > scanners such as REUSE).
> > >
> > > As I commented on a previous version, anyway, I don't think we need
> > > this at all. I guess my comment was ignored though.
> >
> > I guess you meant the comment of suggesting using BUFSIZ in V2?
>
> Right, but on v4, and that was just Friday for everybody involved...
>
> > David replied as quote:
> >
> > "We could use BUFSIZ, but it's massive overkill for
> > reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if
> > intmax_t were 128-bit)."
>
> I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes
> on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer
> doesn't really make a difference.
>
> It's not like we allocate that memory anyway, and I don't think any of
> that memory (or unused holes on the stack we create) is prefetched. And
> regardless of all that... we don't use these functions on any data
> path, it's just during configuration. It can be (relatively) slow.
>
> > It's my fault. I should have waited to see if you have more comments
> > before sending the new version.
>
> Thanks for understanding. My biggest problem with this is: where do I
> comment now? On v4 or v6? Should I even read v5?
>
> Well, anyway, I'll directly review v6 (in a bit) to keep things sane.
Sure, please go with v6.
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-17 6:28 ` [PATCH v6 3/4] tcp: Resend SYN for inbound connections Yumei Huang
@ 2025-10-22 1:16 ` David Gibson
2025-10-22 1:30 ` Yumei Huang
2025-10-23 23:04 ` Stefano Brivio
1 sibling, 1 reply; 44+ messages in thread
From: David Gibson @ 2025-10-22 1:16 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 5921 bytes --]
On Fri, Oct 17, 2025 at 02:28:37PM +0800, Yumei Huang wrote:
> If a client connects while guest is not connected or ready yet,
> resend SYN instead of just resetting connection after 10 seconds.
>
> Use the same backoff calculation for the timeout as linux kernel.
>
> Link: https://bugs.passt.top/show_bug.cgi?id=153
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> tcp.h | 5 +++++
> 2 files changed, 52 insertions(+), 8 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 2ec4b0c..9385132 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,9 +179,11 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> - * connection
> + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> + * SYN. It's the starting timeout for the first SYN retry. If this persists
> + * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> + * tcp_syn_linear_timeouts) times in a row, reset the connection
> *
> * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> @@ -340,7 +342,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT 10 /* s */
> +#define SYN_TIMEOUT_INIT 1 /* s */
> #define ACK_TIMEOUT 2
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
> @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
>
> #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
>
> +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
> +
> /* "Extended" data (not stored in the flow table) for TCP flow migration */
> static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
>
> @@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - if (!(conn->events & ESTABLISHED))
> - it.it_value.tv_sec = SYN_TIMEOUT;
> + if (!(conn->events & ESTABLISHED)) {
> + if (conn->retries < c->tcp.syn_linear_timeouts)
> + it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> + else
> + it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> + (conn->retries - c->tcp.syn_linear_timeouts);
> + }
> else
> it.it_value.tv_sec = ACK_TIMEOUT;
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> tcp_timer_ctl(c, conn);
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> if (!(conn->events & ESTABLISHED)) {
> - flow_dbg(conn, "handshake timeout");
> - tcp_rst(c, conn);
> + if (conn->retries >= TCP_MAX_RETRIES ||
> + conn->retries >= (c->tcp.tcp_syn_retries +
> + c->tcp.syn_linear_timeouts)) {
> + flow_dbg(conn, "handshake timeout");
> + tcp_rst(c, conn);
> + } else {
> + flow_trace(conn, "SYN timeout, retry");
> + tcp_send_flag(c, conn, SYN);
> + conn->retries++;
> + tcp_timer_ctl(c, conn);
> + }
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> flow_dbg(conn, "FIN timeout");
> tcp_rst(c, conn);
> @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> return sl;
> }
>
> +/**
> + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
> + * @c: Execution context
> +*/
> +void tcp_syn_params_init(struct ctx *c)
> +{
> + intmax_t tcp_syn_retries, syn_linear_timeouts;
> +
> + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
> + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
> +
> + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> +
> + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
> + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> +}
> +
> /**
> * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> * @c: Execution context
> @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> {
> ASSERT(!c->no_tcp);
>
> + tcp_syn_params_init(c);
> +
> tcp_sock_iov_init(c);
>
> memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> diff --git a/tcp.h b/tcp.h
> index 234a803..4369b52 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> * @fwd_out: Port forwarding configuration for outbound packets
> * @timer_run: Timestamp of most recent timer run
> * @pipe_size: Size of pipes for spliced connections
> + * @tcp_syn_retries: Number of SYN retries during handshake
> + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> + * before switching to exponential backoff timeout
> */
> struct tcp_ctx {
> struct fwd_ports fwd_in;
> struct fwd_ports fwd_out;
> struct timespec timer_run;
> size_t pipe_size;
> + uint8_t tcp_syn_retries;
> + uint8_t syn_linear_timeouts;
> };
>
> #endif /* TCP_H */
> --
> 2.47.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-17 6:28 ` [PATCH v6 4/4] tcp: Update data retransmission timeout Yumei Huang
@ 2025-10-22 1:19 ` David Gibson
2025-10-22 8:40 ` Yumei Huang
2025-10-23 23:04 ` Stefano Brivio
1 sibling, 1 reply; 44+ messages in thread
From: David Gibson @ 2025-10-22 1:19 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 3630 bytes --]
On Fri, Oct 17, 2025 at 02:28:38PM +0800, Yumei Huang wrote:
> Use an exponential backoff timeout for data retransmission according
> to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> in Appendix A of RFC 6298.
>
> Also combine the macros defining the initial RTO for both SYN and ACK.
>
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Still LGTM. Stefano has more or less convinced me that clamping the
maximum backoff time is worthwhile, but that can easily be a later
patch rather than respinning.
> ---
> tcp.c | 27 ++++++++++++---------------
> 1 file changed, 12 insertions(+), 15 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 9385132..dc0ec6c 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,16 +179,14 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> - * SYN. It's the starting timeout for the first SYN retry. If this persists
> - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> - * tcp_syn_linear_timeouts) times in a row, reset the connection
> - *
> - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> - * socket and reset sequence to what was acknowledged. If this persists for
> - * more than TCP_MAX_RETRIES times in a row, reset the connection
> + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> + * from the socket and reset sequence to what was acknowledged. This is the
> + * timeout for the first retry, in seconds. If this persists too many times
> + * in a row, reset the connection: TCP_MAX_RETRIES for established
> + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> + * handshake.
> *
> * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> @@ -342,8 +340,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT_INIT 1 /* s */
> -#define ACK_TIMEOUT 2
> +#define RTO_INIT 1 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> if (!(conn->events & ESTABLISHED)) {
> if (conn->retries < c->tcp.syn_linear_timeouts)
> - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> + it.it_value.tv_sec = RTO_INIT;
> else
> - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> + it.it_value.tv_sec = RTO_INIT <<
> (conn->retries - c->tcp.syn_linear_timeouts);
> }
> else
> - it.it_value.tv_sec = ACK_TIMEOUT;
> + it.it_value.tv_sec = RTO_INIT << conn->retries;
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
> --
> 2.47.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-22 1:16 ` David Gibson
@ 2025-10-22 1:30 ` Yumei Huang
2025-10-22 2:26 ` David Gibson
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-22 1:30 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, sbrivio
On Wed, Oct 22, 2025 at 9:17 AM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 17, 2025 at 02:28:37PM +0800, Yumei Huang wrote:
> > If a client connects while guest is not connected or ready yet,
> > resend SYN instead of just resetting connection after 10 seconds.
> >
> > Use the same backoff calculation for the timeout as linux kernel.
> >
> > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I guess I need to update the doc string for tcp_syn_retries of the
struct tcp_ctx, as discussed in v5.
Sorry for the inconvenience brought by the multiple versions in a short time.
>
> > ---
> > tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> > tcp.h | 5 +++++
> > 2 files changed, 52 insertions(+), 8 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 2ec4b0c..9385132 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -179,9 +179,11 @@
> > *
> > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > *
> > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > - * connection
> > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > + * SYN. It's the starting timeout for the first SYN retry. If this persists
> > + * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > + * tcp_syn_linear_timeouts) times in a row, reset the connection
> > *
> > * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > @@ -340,7 +342,7 @@ enum {
> > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> >
> > #define ACK_INTERVAL 10 /* ms */
> > -#define SYN_TIMEOUT 10 /* s */
> > +#define SYN_TIMEOUT_INIT 1 /* s */
> > #define ACK_TIMEOUT 2
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> > @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> >
> > #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> >
> > +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
> > +
> > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> >
> > @@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > - if (!(conn->events & ESTABLISHED))
> > - it.it_value.tv_sec = SYN_TIMEOUT;
> > + if (!(conn->events & ESTABLISHED)) {
> > + if (conn->retries < c->tcp.syn_linear_timeouts)
> > + it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > + else
> > + it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > + (conn->retries - c->tcp.syn_linear_timeouts);
> > + }
> > else
> > it.it_value.tv_sec = ACK_TIMEOUT;
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > tcp_timer_ctl(c, conn);
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > if (!(conn->events & ESTABLISHED)) {
> > - flow_dbg(conn, "handshake timeout");
> > - tcp_rst(c, conn);
> > + if (conn->retries >= TCP_MAX_RETRIES ||
> > + conn->retries >= (c->tcp.tcp_syn_retries +
> > + c->tcp.syn_linear_timeouts)) {
> > + flow_dbg(conn, "handshake timeout");
> > + tcp_rst(c, conn);
> > + } else {
> > + flow_trace(conn, "SYN timeout, retry");
> > + tcp_send_flag(c, conn, SYN);
> > + conn->retries++;
> > + tcp_timer_ctl(c, conn);
> > + }
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > flow_dbg(conn, "FIN timeout");
> > tcp_rst(c, conn);
> > @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > return sl;
> > }
> >
> > +/**
> > + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
> > + * @c: Execution context
> > +*/
> > +void tcp_syn_params_init(struct ctx *c)
> > +{
> > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > +
> > + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
> > + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
> > +
> > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > +
> > + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
> > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > +}
> > +
> > /**
> > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > * @c: Execution context
> > @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> > {
> > ASSERT(!c->no_tcp);
> >
> > + tcp_syn_params_init(c);
> > +
> > tcp_sock_iov_init(c);
> >
> > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > diff --git a/tcp.h b/tcp.h
> > index 234a803..4369b52 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> > * @fwd_out: Port forwarding configuration for outbound packets
> > * @timer_run: Timestamp of most recent timer run
> > * @pipe_size: Size of pipes for spliced connections
> > + * @tcp_syn_retries: Number of SYN retries during handshake
> > + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> > + * before switching to exponential backoff timeout
> > */
> > struct tcp_ctx {
> > struct fwd_ports fwd_in;
> > struct fwd_ports fwd_out;
> > struct timespec timer_run;
> > size_t pipe_size;
> > + uint8_t tcp_syn_retries;
> > + uint8_t syn_linear_timeouts;
> > };
> >
> > #endif /* TCP_H */
> > --
> > 2.47.0
> >
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-22 1:30 ` Yumei Huang
@ 2025-10-22 2:26 ` David Gibson
0 siblings, 0 replies; 44+ messages in thread
From: David Gibson @ 2025-10-22 2:26 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, sbrivio
[-- Attachment #1: Type: text/plain, Size: 1116 bytes --]
On Wed, Oct 22, 2025 at 09:30:55AM +0800, Yumei Huang wrote:
> On Wed, Oct 22, 2025 at 9:17 AM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > On Fri, Oct 17, 2025 at 02:28:37PM +0800, Yumei Huang wrote:
> > > If a client connects while guest is not connected or ready yet,
> > > resend SYN instead of just resetting connection after 10 seconds.
> > >
> > > Use the same backoff calculation for the timeout as linux kernel.
> > >
> > > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> >
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> I guess I need to update the doc string for tcp_syn_retries of the
> struct tcp_ctx, as discussed in v5.
True
> Sorry for the inconvenience brought by the multiple versions in a
> short time.
It's ok, taking a while to figure out the culture of a project is
expected.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-22 1:19 ` David Gibson
@ 2025-10-22 8:40 ` Yumei Huang
0 siblings, 0 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-22 8:40 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, sbrivio
On Wed, Oct 22, 2025 at 9:19 AM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 17, 2025 at 02:28:38PM +0800, Yumei Huang wrote:
> > Use an exponential backoff timeout for data retransmission according
> > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > in Appendix A of RFC 6298.
> >
> > Also combine the macros defining the initial RTO for both SYN and ACK.
> >
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> Still LGTM. Stefano has more or less convinced me that clamping the
> maximum backoff time is worthwhile, but that can easily be a later
> patch rather than respinning.
Thank you. Either way, I need to respin for the doc string thing. Once
Stefano finishes the review for v6, I can send v7 with clamping the
maximum timeout.
>
> > ---
> > tcp.c | 27 ++++++++++++---------------
> > 1 file changed, 12 insertions(+), 15 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 9385132..dc0ec6c 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -179,16 +179,14 @@
> > *
> > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > *
> > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > - *
> > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > - * socket and reset sequence to what was acknowledged. If this persists for
> > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > + * from the socket and reset sequence to what was acknowledged. This is the
> > + * timeout for the first retry, in seconds. If this persists too many times
> > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > + * handshake.
> > *
> > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > @@ -342,8 +340,7 @@ enum {
> > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> >
> > #define ACK_INTERVAL 10 /* ms */
> > -#define SYN_TIMEOUT_INIT 1 /* s */
> > -#define ACK_TIMEOUT 2
> > +#define RTO_INIT 1 /* s, RFC 6298 */
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> >
> > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > if (!(conn->events & ESTABLISHED)) {
> > if (conn->retries < c->tcp.syn_linear_timeouts)
> > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > + it.it_value.tv_sec = RTO_INIT;
> > else
> > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > + it.it_value.tv_sec = RTO_INIT <<
> > (conn->retries - c->tcp.syn_linear_timeouts);
> > }
> > else
> > - it.it_value.tv_sec = ACK_TIMEOUT;
> > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > it.it_value.tv_sec = FIN_TIMEOUT;
> > } else {
> > --
> > 2.47.0
> >
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-22 0:51 ` David Gibson
@ 2025-10-22 8:42 ` Yumei Huang
0 siblings, 0 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-22 8:42 UTC (permalink / raw)
To: David Gibson; +Cc: Stefano Brivio, passt-dev
On Wed, Oct 22, 2025 at 9:01 AM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Tue, Oct 21, 2025 at 11:50:59PM +0200, Stefano Brivio wrote:
> > On Tue, 21 Oct 2025 17:32:58 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Fri, 17 Oct 2025 14:28:36 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > >
> > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > ---
> > > > > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > util.h | 8 ++++++
> > > > > 2 files changed, 92 insertions(+)
> > > > >
> > > > > diff --git a/util.c b/util.c
> > > > > index c492f90..5c8c4bc 100644
> > > > > --- a/util.c
> > > > > +++ b/util.c
> > > > > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > > > > return len == 0 ? 0 : -1;
> > > > > }
> > > > >
> > > > > +/**
> > > > > + * read_file() - Read contents of file into a buffer
> > > > > + * @path: File to read
> > > > > + * @buf: Buffer to store file contents
> > > > > + * @buf_size: Size of buffer
> > > > > + *
> > > > > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> > > > > +*/
> > > > > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > > > > +{
> > > > > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > > > > + size_t total_read = 0;
> > > > > + ssize_t rc;
> > > > > +
> > > > > + if (fd < 0) {
> > > > > + warn_perror("Could not open %s", path);
> > > > > + return -1;
> > > > > + }
> > > > > +
> > > > > + while (total_read < buf_size) {
> > > > > + rc = read(fd, buf + total_read, buf_size - total_read);
> > > > > +
> > > > > + if (rc < 0) {
> > > > > + warn_perror("Couldn't read from %s", path);
> > > > > + close(fd);
> > > > > + return -1;
> > > > > + }
> > > > > +
> > > > > + if (rc == 0)
> > > > > + break;
> > > > > +
> > > > > + total_read += rc;
> > > > > + }
> > > > > +
> > > > > + close(fd);
> > > > > +
> > > > > + if (total_read == buf_size) {
> > > > > + warn("File %s truncated, buffer too small", path);
> > > > > + return -2;
> > > > > + }
> > > > > +
> > > > > + buf[total_read] = '\0';
> > > > > +
> > > > > + return total_read;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * read_file_integer() - Read an integer value from a file
> > > > > + * @path: File to read
> > > > > + * @fallback: Default value if file can't be read
> > > > > + *
> > > > > + * Return: Integer value, fallback on failure
> > > > > +*/
> > > > > +intmax_t read_file_integer(const char *path, intmax_t fallback)
> > > > > +{
> > > > > + char buf[INTMAX_STRLEN];
> > > > > + ssize_t bytes_read;
> > > > > + intmax_t value;
> > > > > + char *end;
> > > > > +
> > > > > + bytes_read = read_file(path, buf, sizeof(buf));
> > > > > +
> > > > > + if (bytes_read < 0)
> > > > > + return fallback;
> > > > > +
> > > > > + if (bytes_read == 0) {
> > > > > + debug("Empty file %s", path);
> > > > > + return fallback;
> > > > > + }
> > > > > +
> > > > > + errno = 0;
> > > > > + value = strtoimax(buf, &end, 10);
> > > > > + if (*end && *end != '\n') {
> > > > > + debug("Invalid format in %s", path);
> > > > > + return fallback;
> > > > > + }
> > > > > + if (errno) {
> > > > > + debug("Invalid value in %s: %s", path, buf);
> > > > > + return fallback;
> > > > > + }
> > > > > +
> > > > > + return value;
> > > > > +}
> > > > > +
> > > > > #ifdef __ia64__
> > > > > /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> > > > > * use the description from clone(2).
> > > > > diff --git a/util.h b/util.h
> > > > > index 22eaac5..3f9f296 100644
> > > > > --- a/util.h
> > > > > +++ b/util.h
> > > > > @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> > > > > int __daemon(int pidfile_fd, int devnull_fd);
> > > > > int fls(unsigned long x);
> > > > > int write_file(const char *path, const char *buf);
> > > > > +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> > > > > +intmax_t read_file_integer(const char *path, intmax_t fallback);
> > > > > int write_all_buf(int fd, const void *buf, size_t len);
> > > > > int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> > > > > int read_all_buf(int fd, void *buf, size_t len);
> > > > > @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
> > > > >
> > > > > #define UINT16_STRLEN (sizeof("65535"))
> > > > >
> > > > > +/* Each byte expands to at most 3 decimal digits since 0xff == 255.
> > > > > + * Plus 2 extra bytes for the sign and null terminator.
> > > > > + * See https://stackoverflow.com/a/10536254.
> > > >
> > > > This is not an acceptable form of attribution according to the
> > > > CC BY-SA 3.0 terms. See:
> > > >
> > > > https://stackoverflow.com/help/licensing
> > > > https://creativecommons.org/licenses/by-sa/3.0/
> > > >
> > > > and checksum.h in this tree for some examples of how to combine
> > > > different licensing terms in a single file, in a way that's
> > > > human-readable but still machine-friendly (for license / compliance
> > > > scanners such as REUSE).
> > > >
> > > > As I commented on a previous version, anyway, I don't think we need
> > > > this at all. I guess my comment was ignored though.
> > >
> > > I guess you meant the comment of suggesting using BUFSIZ in V2?
> >
> > Right, but on v4, and that was just Friday for everybody involved...
> >
> > > David replied as quote:
> > >
> > > "We could use BUFSIZ, but it's massive overkill for
> > > reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if
> > > intmax_t were 128-bit)."
> >
> > I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes
> > on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer
> > doesn't really make a difference.
> >
> > It's not like we allocate that memory anyway, and I don't think any of
> > that memory (or unused holes on the stack we create) is prefetched. And
> > regardless of all that... we don't use these functions on any data
> > path, it's just during configuration. It can be (relatively) slow.
>
> Right. Sorry, I didn't follow up yet, but after I wrote that, Stefano
> convinced me that BUFSIZ is fine. I originally suggested avoiding the
> arbitrary buffer largely because I thought having an INTMAX_STRLEN
> constant might be useful for other reasons too, but honestly, not
> very.
I will update that in v7.
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-17 6:28 ` [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function Yumei Huang
2025-10-19 10:07 ` Stefano Brivio
@ 2025-10-23 23:04 ` Stefano Brivio
2025-10-24 3:16 ` David Gibson
1 sibling, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-23 23:04 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
Sorry for the delay, mostly nits but a couple of substantial comments:
On Fri, 17 Oct 2025 14:28:36 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> ---
> util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> util.h | 8 ++++++
> 2 files changed, 92 insertions(+)
>
> diff --git a/util.c b/util.c
> index c492f90..5c8c4bc 100644
> --- a/util.c
> +++ b/util.c
> @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> return len == 0 ? 0 : -1;
> }
>
> +/**
> + * read_file() - Read contents of file into a buffer
> + * @path: File to read
I see this is the same as write_file(), so in some sense it's
pre-existing, but @path isn't really a "file" in the sense that it's
not a file descriptor as one might expect from the description alone.
I'd rather say "Path to file" or "Path to file to read" or something
like that. On the other hand, if you want to keep this consistent with
write_file(), never mind. Not a strong preference from me.
> + * @buf: Buffer to store file contents
> + * @buf_size: Size of buffer
> + *
> + * Return: number of bytes read on success, -1 on any error, -2 on truncation
Similar comment here: this is partially symmetric to read_file, but
it's yet another convention we are introducing, because of the -2
special value.
Other somewhat related functions in util.c return with a meaningful
errno value set, this one doesn't.
The majority of helpers in passt, though, return with a negative
errno-like value, and truncation can be very well represented by
returning -ENOBUFS, see snprintf_check(). I think that's preferable.
Again, if the intention is to make this consistent to write_file(), it
can be left as it is.
> +*/
> +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> +{
> + int fd = open(path, O_RDONLY | O_CLOEXEC);
> + size_t total_read = 0;
> + ssize_t rc;
> +
> + if (fd < 0) {
> + warn_perror("Could not open %s", path);
> + return -1;
> + }
> +
> + while (total_read < buf_size) {
> + rc = read(fd, buf + total_read, buf_size - total_read);
> +
> + if (rc < 0) {
> + warn_perror("Couldn't read from %s", path);
> + close(fd);
> + return -1;
> + }
> +
> + if (rc == 0)
> + break;
> +
> + total_read += rc;
Coverity Scan (I can provide instructions separately if desired)
reports one issue below, but I'll mention it here for clarity: you are
adding 'rc', of type ssize_t, to total_read, of type size_t, and
buf_size is also of type size_t, so you could overflow total_read by
adding for example the maximum value for ssize_t twice, to it.
We can't run into the (theoretical) issue fixed by d836d9e34586 ("util:
Remove possible quadratic behaviour from write_remainder()") but the
solution here might be similar.
In general we should make sure that rc is less than whatever value we
might sum to total_read to make it overflow at any point in time.
I didn't really check this in detail, I can do that if needed, and
perhaps David remembers more clearly what we did in a similar
situation. It might also be a false positive, by the way.
---
/home/sbrivio/passt/util.c:625:2:
Type: Overflowed return value (INTEGER_OVERFLOW)
/home/sbrivio/passt/util.c:596:2:
1. path: Condition "fd < 0", taking false branch.
/home/sbrivio/passt/util.c:601:2:
2. path: Condition "total_read < buf_size", taking true branch.
/home/sbrivio/passt/util.c:604:3:
3. path: Condition "rc < 0", taking false branch.
/home/sbrivio/passt/util.c:610:3:
4. path: Condition "rc == 0", taking false branch.
/home/sbrivio/passt/util.c:614:2:
5. path: Jumping back to the beginning of the loop.
/home/sbrivio/passt/util.c:601:2:
6. path: Condition "total_read < buf_size", taking true branch.
/home/sbrivio/passt/util.c:602:3:
7. tainted_data_return: Called function "read(fd, buf + total_read, buf_size - total_read)", and a possible return value may be less than zero.
/home/sbrivio/passt/util.c:602:3:
8. assign: Assigning: "rc" = "read(fd, buf + total_read, buf_size - total_read)".
/home/sbrivio/passt/util.c:604:3:
9. path: Condition "rc < 0", taking false branch.
/home/sbrivio/passt/util.c:610:3:
10. path: Condition "rc == 0", taking false branch.
/home/sbrivio/passt/util.c:613:3:
11. overflow: The expression "total_read" is considered to have possibly overflowed.
/home/sbrivio/passt/util.c:614:2:
12. path: Jumping back to the beginning of the loop.
/home/sbrivio/passt/util.c:601:2:
13. path: Condition "total_read < buf_size", taking true branch.
/home/sbrivio/passt/util.c:604:3:
14. path: Condition "rc < 0", taking false branch.
/home/sbrivio/passt/util.c:610:3:
15. path: Condition "rc == 0", taking true branch.
/home/sbrivio/passt/util.c:611:4:
16. path: Breaking from loop.
/home/sbrivio/passt/util.c:618:2:
17. path: Condition "total_read == buf_size", taking false branch.
/home/sbrivio/passt/util.c:625:2:
18. return_overflow: "total_read", which might have overflowed, is returned from the function.
---
> + }
> +
> + close(fd);
> +
> + if (total_read == buf_size) {
> + warn("File %s truncated, buffer too small", path);
The file wasn't truncated (on disk) as this comment might seem to
indicate. I'd rather say "File contents exceed buffer size", or
"Partial file read", something like that.
While at it, you could print the size we read (it's %zu, see similar
examples where we print size_t types).
> + return -2;
Safer to NULL-terminate also in this case, perhaps? A future caller
might handle -2 (or equivalent) as a "partial" failure and use the
buffer anyway, so not NULL-terminating it is rather subtle.
> + }
> +
> + buf[total_read] = '\0';
This is not in the comment to the function ("Read contents of file into
NULL-terminated buffer"?).
> +
> + return total_read;
> +}
> +
> +/**
> + * read_file_integer() - Read an integer value from a file
> + * @path: File to read
> + * @fallback: Default value if file can't be read
These need to be aligned in the usual way (tabs).
> + *
> + * Return: Integer value, fallback on failure
That would be @fallback if you mean 'fallback' the argument.
See also: 9e0423e13541 ("style: Fix 'Return' comment style").
> +*/
> +intmax_t read_file_integer(const char *path, intmax_t fallback)
> +{
> + char buf[INTMAX_STRLEN];
> + ssize_t bytes_read;
> + intmax_t value;
> + char *end;
> +
> + bytes_read = read_file(path, buf, sizeof(buf));
> +
> + if (bytes_read < 0)
> + return fallback;
> +
> + if (bytes_read == 0) {
> + debug("Empty file %s", path);
> + return fallback;
> + }
> +
> + errno = 0;
> + value = strtoimax(buf, &end, 10);
> + if (*end && *end != '\n') {
> + debug("Invalid format in %s", path);
It took me a bit to understand that you mean that the file contains
non-numbers except for '-'. Perhaps mention that explicitly, otherwise
it looks like the file needs to have a somewhat special format? Say,
"Non-numeric content in %s".
> + return fallback;
> + }
> + if (errno) {
> + debug("Invalid value in %s: %s", path, buf);
I just learnt from strtoimax(3) that this can only be an underflow or
overflow. Maybe mention that ("Out of range value in..."), instead of
"invalid", which is kind of arbitrary?
> + return fallback;
> + }
> +
> + return value;
> +}
> +
> #ifdef __ia64__
> /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(),
> * use the description from clone(2).
> diff --git a/util.h b/util.h
> index 22eaac5..3f9f296 100644
> --- a/util.h
> +++ b/util.h
> @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid);
> int __daemon(int pidfile_fd, int devnull_fd);
> int fls(unsigned long x);
> int write_file(const char *path, const char *buf);
> +ssize_t read_file(const char *path, char *buf, size_t buf_size);
> +intmax_t read_file_integer(const char *path, intmax_t fallback);
> int write_all_buf(int fd, const void *buf, size_t len);
> int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> int read_all_buf(int fd, void *buf, size_t len);
> @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
>
> #define UINT16_STRLEN (sizeof("65535"))
>
> +/* Each byte expands to at most 3 decimal digits since 0xff == 255.
> + * Plus 2 extra bytes for the sign and null terminator.
> + * See https://stackoverflow.com/a/10536254.
> + */
> +#define INTMAX_STRLEN (sizeof(intmax_t) * 3 + 2)
> +
As discussed, we can probably use BUFSIZ instead, and keep this simple.
> /* inet address (- '\0') + port (u16) (- '\0') + ':' + '\0' */
> #define SOCKADDR_INET_STRLEN \
> (INET_ADDRSTRLEN-1 + UINT16_STRLEN-1 + sizeof(":"))
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-17 6:28 ` [PATCH v6 3/4] tcp: Resend SYN for inbound connections Yumei Huang
2025-10-22 1:16 ` David Gibson
@ 2025-10-23 23:04 ` Stefano Brivio
2025-10-24 3:30 ` David Gibson
2025-10-28 7:43 ` Yumei Huang
1 sibling, 2 replies; 44+ messages in thread
From: Stefano Brivio @ 2025-10-23 23:04 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Fri, 17 Oct 2025 14:28:37 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> If a client connects while guest is not connected or ready yet,
> resend SYN instead of just resetting connection after 10 seconds.
>
> Use the same backoff calculation for the timeout as linux kernel.
Linux.
>
> Link: https://bugs.passt.top/show_bug.cgi?id=153
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> ---
> tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> tcp.h | 5 +++++
> 2 files changed, 52 insertions(+), 8 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 2ec4b0c..9385132 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,9 +179,11 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> - * connection
> + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> + * SYN. It's the starting timeout for the first SYN retry. If this persists
"If this persists" makes sense for the existing ACK_TIMEOUT
description but not here, because it looks like it refers to "starting
timeout".
Coupled with the next patch, it becomes increasingly difficult to
understand what "this" persisting thing is.
Maybe directly say "Retry for ..., then reset the connection"? It's
shorter and clearer.
> + * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> + * tcp_syn_linear_timeouts) times in a row, reset the connection
> *
> * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> @@ -340,7 +342,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT 10 /* s */
> +#define SYN_TIMEOUT_INIT 1 /* s */
Maybe mention RFC 6928 as done above? That's where this value comes
from.
I just noticed you do that in 4/4, so it's slightly nicer if you do
that right away here for ease of future reference, but not really
needed.
> #define ACK_TIMEOUT 2
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
> @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
>
> #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
>
> +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
This uses 121 columns. I'm not sure where all those tabs and \ come
from.
> +
> /* "Extended" data (not stored in the flow table) for TCP flow migration */
> static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
>
> @@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> if (conn->flags & ACK_TO_TAP_DUE) {
> it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> - if (!(conn->events & ESTABLISHED))
> - it.it_value.tv_sec = SYN_TIMEOUT;
> + if (!(conn->events & ESTABLISHED)) {
> + if (conn->retries < c->tcp.syn_linear_timeouts)
> + it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> + else
> + it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> + (conn->retries - c->tcp.syn_linear_timeouts);
Probably more readable, but I haven't tried: always start from
SYN_TIMEOUT_INIT, then multiply/shift if
conn->retries >= c->tcp.syn_linear_timeouts.
> + }
> else
> it.it_value.tv_sec = ACK_TIMEOUT;
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> tcp_timer_ctl(c, conn);
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> if (!(conn->events & ESTABLISHED)) {
> - flow_dbg(conn, "handshake timeout");
> - tcp_rst(c, conn);
> + if (conn->retries >= TCP_MAX_RETRIES ||
> + conn->retries >= (c->tcp.tcp_syn_retries +
> + c->tcp.syn_linear_timeouts)) {
> + flow_dbg(conn, "handshake timeout");
> + tcp_rst(c, conn);
> + } else {
> + flow_trace(conn, "SYN timeout, retry");
> + tcp_send_flag(c, conn, SYN);
> + conn->retries++;
I think I already raised this point on a previous revision: this needs
to be zeroed as the connection is established, but I don't see that in
the current version.
> + tcp_timer_ctl(c, conn);
> + }
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> flow_dbg(conn, "FIN timeout");
> tcp_rst(c, conn);
> @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> return sl;
> }
>
> +/**
> + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
They're not initial, they'll be used for all the connections if I
understand correctly.
Maybe "Get SYN retries sysctl values"? I think the _init() in the
function name is also somewhat misleading.
> + * @c: Execution context
> +*/
> +void tcp_syn_params_init(struct ctx *c)
> +{
> + intmax_t tcp_syn_retries, syn_linear_timeouts;
> +
> + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
Why 8? Perhaps a #define would help?
> + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
> +
> + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> +
> + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
Similar to the comment above: these are not parameters of SYN segments
(which would seem to imply TCP options, such as the MSS).
We typically don't print C assignments, rather human-readable messages,
so that could be "Read sysctl values tcp_syn_retries: ...,
syn_linear_timeouts: ...".
> + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> +}
> +
> /**
> * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> * @c: Execution context
> @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> {
> ASSERT(!c->no_tcp);
>
> + tcp_syn_params_init(c);
> +
> tcp_sock_iov_init(c);
>
> memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> diff --git a/tcp.h b/tcp.h
> index 234a803..4369b52 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> * @fwd_out: Port forwarding configuration for outbound packets
> * @timer_run: Timestamp of most recent timer run
> * @pipe_size: Size of pipes for spliced connections
> + * @tcp_syn_retries: Number of SYN retries during handshake
> + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> + * before switching to exponential backoff timeout
Maybe more compact:
* @syn_linear_timeouts: SYN retries before using exponential timeout
> */
> struct tcp_ctx {
> struct fwd_ports fwd_in;
> struct fwd_ports fwd_out;
> struct timespec timer_run;
> size_t pipe_size;
> + uint8_t tcp_syn_retries;
> + uint8_t syn_linear_timeouts;
> };
>
> #endif /* TCP_H */
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-17 6:28 ` [PATCH v6 4/4] tcp: Update data retransmission timeout Yumei Huang
2025-10-22 1:19 ` David Gibson
@ 2025-10-23 23:04 ` Stefano Brivio
2025-10-28 8:09 ` Yumei Huang
1 sibling, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-23 23:04 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Fri, 17 Oct 2025 14:28:38 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> Use an exponential backoff timeout for data retransmission according
> to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> in Appendix A of RFC 6298.
>
> Also combine the macros defining the initial RTO for both SYN and ACK.
>
> Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 27 ++++++++++++---------------
> 1 file changed, 12 insertions(+), 15 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 9385132..dc0ec6c 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -179,16 +179,14 @@
> *
> * Timeouts are implemented by means of timerfd timers, set based on flags:
> *
> - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> - * SYN. It's the starting timeout for the first SYN retry. If this persists
> - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> - * tcp_syn_linear_timeouts) times in a row, reset the connection
> - *
> - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> - * socket and reset sequence to what was acknowledged. If this persists for
> - * more than TCP_MAX_RETRIES times in a row, reset the connection
> + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> + * from the socket and reset sequence to what was acknowledged. This is the
> + * timeout for the first retry, in seconds. If this persists too many times
> + * in a row, reset the connection: TCP_MAX_RETRIES for established
> + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> + * handshake.
> *
> * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> @@ -342,8 +340,7 @@ enum {
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> #define ACK_INTERVAL 10 /* ms */
> -#define SYN_TIMEOUT_INIT 1 /* s */
> -#define ACK_TIMEOUT 2
> +#define RTO_INIT 1 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> #define ACT_TIMEOUT 7200
>
> @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> if (!(conn->events & ESTABLISHED)) {
> if (conn->retries < c->tcp.syn_linear_timeouts)
> - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> + it.it_value.tv_sec = RTO_INIT;
> else
> - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> + it.it_value.tv_sec = RTO_INIT <<
> (conn->retries - c->tcp.syn_linear_timeouts);
> }
> else
> - it.it_value.tv_sec = ACK_TIMEOUT;
> + it.it_value.tv_sec = RTO_INIT << conn->retries;
Same as on 3/4, but here it's clearly more convenient: just assign
RTO_INIT, and multiply as needed in the if / else clauses.
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> it.it_value.tv_sec = FIN_TIMEOUT;
> } else {
The rest of the series looks good to me.
It might be slightly more practical to factor in directly the RTO
clamp, and I don't think it's complicated now that you have the helper
from 2/4, but it's not a strong preference from my side, as the series
makes sense in any case.
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-23 23:04 ` Stefano Brivio
@ 2025-10-24 3:16 ` David Gibson
2025-10-24 6:05 ` Yumei Huang
2025-10-28 7:11 ` Yumei Huang
0 siblings, 2 replies; 44+ messages in thread
From: David Gibson @ 2025-10-24 3:16 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yumei Huang, passt-dev
[-- Attachment #1: Type: text/plain, Size: 5719 bytes --]
On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
> Sorry for the delay, mostly nits but a couple of substantial comments:
>
> On Fri, 17 Oct 2025 14:28:36 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > ---
> > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > util.h | 8 ++++++
> > 2 files changed, 92 insertions(+)
> >
> > diff --git a/util.c b/util.c
> > index c492f90..5c8c4bc 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > return len == 0 ? 0 : -1;
> > }
> >
> > +/**
> > + * read_file() - Read contents of file into a buffer
> > + * @path: File to read
>
> I see this is the same as write_file(), so in some sense it's
> pre-existing, but @path isn't really a "file" in the sense that it's
> not a file descriptor as one might expect from the description alone.
>
> I'd rather say "Path to file" or "Path to file to read" or something
> like that. On the other hand, if you want to keep this consistent with
> write_file(), never mind. Not a strong preference from me.
That's a good idea, but it's not crucial to the aim of this series, so
I'd suggest doing it as a later patch.
> > + * @buf: Buffer to store file contents
> > + * @buf_size: Size of buffer
> > + *
> > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
>
> Similar comment here: this is partially symmetric to read_file, but
> it's yet another convention we are introducing, because of the -2
> special value.
>
> Other somewhat related functions in util.c return with a meaningful
> errno value set, this one doesn't.
>
> The majority of helpers in passt, though, return with a negative
> errno-like value, and truncation can be very well represented by
> returning -ENOBUFS, see snprintf_check(). I think that's preferable.
>
> Again, if the intention is to make this consistent to write_file(), it
> can be left as it is.
Similarly. I considered commenting earlier on the -2 or truncation -
we don't actually use this, and it's a bit ugly. On the other hand it
doesn't hurt anything, so again, I think it can wait.
> > +*/
> > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > +{
> > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > + size_t total_read = 0;
> > + ssize_t rc;
> > +
> > + if (fd < 0) {
> > + warn_perror("Could not open %s", path);
> > + return -1;
> > + }
> > +
> > + while (total_read < buf_size) {
> > + rc = read(fd, buf + total_read, buf_size - total_read);
> > +
> > + if (rc < 0) {
> > + warn_perror("Couldn't read from %s", path);
> > + close(fd);
> > + return -1;
> > + }
> > +
> > + if (rc == 0)
> > + break;
> > +
> > + total_read += rc;
>
> Coverity Scan (I can provide instructions separately if desired)
> reports one issue below, but I'll mention it here for clarity: you are
> adding 'rc', of type ssize_t, to total_read, of type size_t, and
> buf_size is also of type size_t, so you could overflow total_read by
> adding for example the maximum value for ssize_t twice, to it.
>
> We can't run into the (theoretical) issue fixed by d836d9e34586 ("util:
> Remove possible quadratic behaviour from write_remainder()") but the
> solution here might be similar.
>
> In general we should make sure that rc is less than whatever value we
> might sum to total_read to make it overflow at any point in time.
>
> I didn't really check this in detail, I can do that if needed, and
> perhaps David remembers more clearly what we did in a similar
> situation. It might also be a false positive, by the way.
I think there are two slightly overlapping issues here.
1) I'm not sure Coverity knows/trusts that read() will never return
more than its third argument. That's what stops total_read from
ever exceeding buf_size. I'd need to think a bit harder about how
to convince it that's the case.
2) buf_size is size_t, but we're returning ssize_t. If we passed a
buf_size greater than ssize_t can hold, it would make a mess (UB, I
think). I don't think there are any perfectly elegant solutions in
C, so I'd suggest:
ASSERT(buf_size <= SSIZE_MAX);
at the top of the function.
I'd try (2) first because it's a real (if unlikely to be triggered)
bug. Then we can see if Coverity still complains (Yumei, I can walk
you through how to install and run Coverity locally using Red Hat's
subscription).
[snip]
> > + }
> > +
> > + close(fd);
> > +
> > + if (total_read == buf_size) {
> > + warn("File %s truncated, buffer too small", path);
>
> The file wasn't truncated (on disk) as this comment might seem to
> indicate. I'd rather say "File contents exceed buffer size", or
> "Partial file read", something like that.
>
> While at it, you could print the size we read (it's %zu, see similar
> examples where we print size_t types).
>
> > + return -2;
>
> Safer to NULL-terminate also in this case, perhaps? A future caller
> might handle -2 (or equivalent) as a "partial" failure and use the
> buffer anyway, so not NULL-terminating it is rather subtle.
That's a good idea. Given the purpose of the function, I think a
caller _should_ ignore the buffer if it gets an error, but it's
worthwhile to limit the damage if a caller forgets to check. That
applies for other error cases too.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-23 23:04 ` Stefano Brivio
@ 2025-10-24 3:30 ` David Gibson
2025-10-24 8:37 ` Stefano Brivio
2025-10-28 7:43 ` Yumei Huang
1 sibling, 1 reply; 44+ messages in thread
From: David Gibson @ 2025-10-24 3:30 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yumei Huang, passt-dev
[-- Attachment #1: Type: text/plain, Size: 6440 bytes --]
On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
> On Fri, 17 Oct 2025 14:28:37 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > If a client connects while guest is not connected or ready yet,
> > resend SYN instead of just resetting connection after 10 seconds.
> >
> > Use the same backoff calculation for the timeout as linux kernel.
>
> Linux.
>
> >
> > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > ---
> > tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> > tcp.h | 5 +++++
> > 2 files changed, 52 insertions(+), 8 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 2ec4b0c..9385132 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -179,9 +179,11 @@
> > *
> > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > *
> > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > - * connection
> > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > + * SYN. It's the starting timeout for the first SYN retry. If this persists
>
> "If this persists" makes sense for the existing ACK_TIMEOUT
> description but not here, because it looks like it refers to "starting
> timeout".
>
> Coupled with the next patch, it becomes increasingly difficult to
> understand what "this" persisting thing is.
Yeah. This was my suggested wording, based on the existing wording
for ACK_TIMEOUT. It's not great, but I struggled a bit to find better
wording.
> Maybe directly say "Retry for ..., then reset the connection"? It's
> shorter and clearer.
There it is :). "Retry (NNN) times, then reset the connection".
[snip]
> > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > tcp_timer_ctl(c, conn);
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > if (!(conn->events & ESTABLISHED)) {
> > - flow_dbg(conn, "handshake timeout");
> > - tcp_rst(c, conn);
> > + if (conn->retries >= TCP_MAX_RETRIES ||
> > + conn->retries >= (c->tcp.tcp_syn_retries +
> > + c->tcp.syn_linear_timeouts)) {
> > + flow_dbg(conn, "handshake timeout");
> > + tcp_rst(c, conn);
> > + } else {
> > + flow_trace(conn, "SYN timeout, retry");
> > + tcp_send_flag(c, conn, SYN);
> > + conn->retries++;
>
> I think I already raised this point on a previous revision: this needs
> to be zeroed as the connection is established, but I don't see that in
> the current version.
Yes, you raised that, but then I realised it's already handled. I
think I put that in the thread, not just direct to Yumei, but maybe
not? Or it just got lost in the minutiae.
When we receive a SYN-ACK, it will have th->ack_seq advanced a byte
acknowledging the SYN. tcp_tap_handler() calls
tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see
the new ack_seq and clear retries (retrans before this series).
> > + tcp_timer_ctl(c, conn);
> > + }
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > flow_dbg(conn, "FIN timeout");
> > tcp_rst(c, conn);
> > @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > return sl;
> > }
> >
> > +/**
> > + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
>
> They're not initial, they'll be used for all the connections if I
> understand correctly.
>
> Maybe "Get SYN retries sysctl values"? I think the _init() in the
> function name is also somewhat misleading.
"Get host kernel RTO parameters"? Since we're thinking of extending
this to cover the RTO upper bound as well as the SYN specific
parameters.
>
> > + * @c: Execution context
> > +*/
> > +void tcp_syn_params_init(struct ctx *c)
> > +{
> > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > +
> > + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
>
> Why 8? Perhaps a #define would help?
>
> > + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
> > +
> > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > +
> > + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
>
> Similar to the comment above: these are not parameters of SYN segments
> (which would seem to imply TCP options, such as the MSS).
>
> We typically don't print C assignments, rather human-readable messages,
> so that could be "Read sysctl values tcp_syn_retries: ...,
> syn_linear_timeouts: ...".
>
>
> > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > +}
> > +
> > /**
> > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > * @c: Execution context
> > @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> > {
> > ASSERT(!c->no_tcp);
> >
> > + tcp_syn_params_init(c);
> > +
> > tcp_sock_iov_init(c);
> >
> > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > diff --git a/tcp.h b/tcp.h
> > index 234a803..4369b52 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> > * @fwd_out: Port forwarding configuration for outbound packets
> > * @timer_run: Timestamp of most recent timer run
> > * @pipe_size: Size of pipes for spliced connections
> > + * @tcp_syn_retries: Number of SYN retries during handshake
> > + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> > + * before switching to exponential backoff timeout
>
> Maybe more compact:
>
> * @syn_linear_timeouts: SYN retries before using exponential timeout
>
> > */
> > struct tcp_ctx {
> > struct fwd_ports fwd_in;
> > struct fwd_ports fwd_out;
> > struct timespec timer_run;
> > size_t pipe_size;
> > + uint8_t tcp_syn_retries;
> > + uint8_t syn_linear_timeouts;
> > };
> >
> > #endif /* TCP_H */
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-24 3:16 ` David Gibson
@ 2025-10-24 6:05 ` Yumei Huang
2025-10-28 7:11 ` Yumei Huang
1 sibling, 0 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-24 6:05 UTC (permalink / raw)
To: David Gibson; +Cc: Stefano Brivio, passt-dev
On Fri, Oct 24, 2025 at 11:30 AM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
> > Sorry for the delay, mostly nits but a couple of substantial comments:
> >
> > On Fri, 17 Oct 2025 14:28:36 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > util.h | 8 ++++++
> > > 2 files changed, 92 insertions(+)
> > >
> > > diff --git a/util.c b/util.c
> > > index c492f90..5c8c4bc 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > > return len == 0 ? 0 : -1;
> > > }
> > >
> > > +/**
> > > + * read_file() - Read contents of file into a buffer
> > > + * @path: File to read
> >
> > I see this is the same as write_file(), so in some sense it's
> > pre-existing, but @path isn't really a "file" in the sense that it's
> > not a file descriptor as one might expect from the description alone.
> >
> > I'd rather say "Path to file" or "Path to file to read" or something
> > like that. On the other hand, if you want to keep this consistent with
> > write_file(), never mind. Not a strong preference from me.
>
> That's a good idea, but it's not crucial to the aim of this series, so
> I'd suggest doing it as a later patch.
>
> > > + * @buf: Buffer to store file contents
> > > + * @buf_size: Size of buffer
> > > + *
> > > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> >
> > Similar comment here: this is partially symmetric to read_file, but
> > it's yet another convention we are introducing, because of the -2
> > special value.
> >
> > Other somewhat related functions in util.c return with a meaningful
> > errno value set, this one doesn't.
> >
> > The majority of helpers in passt, though, return with a negative
> > errno-like value, and truncation can be very well represented by
> > returning -ENOBUFS, see snprintf_check(). I think that's preferable.
> >
> > Again, if the intention is to make this consistent to write_file(), it
> > can be left as it is.
>
> Similarly. I considered commenting earlier on the -2 or truncation -
> we don't actually use this, and it's a bit ugly. On the other hand it
> doesn't hurt anything, so again, I think it can wait.
>
> > > +*/
> > > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > > +{
> > > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > > + size_t total_read = 0;
> > > + ssize_t rc;
> > > +
> > > + if (fd < 0) {
> > > + warn_perror("Could not open %s", path);
> > > + return -1;
> > > + }
> > > +
> > > + while (total_read < buf_size) {
> > > + rc = read(fd, buf + total_read, buf_size - total_read);
> > > +
> > > + if (rc < 0) {
> > > + warn_perror("Couldn't read from %s", path);
> > > + close(fd);
> > > + return -1;
> > > + }
> > > +
> > > + if (rc == 0)
> > > + break;
> > > +
> > > + total_read += rc;
> >
> > Coverity Scan (I can provide instructions separately if desired)
> > reports one issue below, but I'll mention it here for clarity: you are
> > adding 'rc', of type ssize_t, to total_read, of type size_t, and
> > buf_size is also of type size_t, so you could overflow total_read by
> > adding for example the maximum value for ssize_t twice, to it.
> >
> > We can't run into the (theoretical) issue fixed by d836d9e34586 ("util:
> > Remove possible quadratic behaviour from write_remainder()") but the
> > solution here might be similar.
> >
> > In general we should make sure that rc is less than whatever value we
> > might sum to total_read to make it overflow at any point in time.
> >
> > I didn't really check this in detail, I can do that if needed, and
> > perhaps David remembers more clearly what we did in a similar
> > situation. It might also be a false positive, by the way.
>
> I think there are two slightly overlapping issues here.
>
> 1) I'm not sure Coverity knows/trusts that read() will never return
> more than its third argument. That's what stops total_read from
> ever exceeding buf_size. I'd need to think a bit harder about how
> to convince it that's the case.
>
> 2) buf_size is size_t, but we're returning ssize_t. If we passed a
> buf_size greater than ssize_t can hold, it would make a mess (UB, I
> think). I don't think there are any perfectly elegant solutions in
> C, so I'd suggest:
> ASSERT(buf_size <= SSIZE_MAX);
>
> at the top of the function.
>
> I'd try (2) first because it's a real (if unlikely to be triggered)
> Yumei, I can walk
> you through how to install and run Coverity locally using Red Hat's
> subscription.
That would be great!
>
> [snip]
> > > + }
> > > +
> > > + close(fd);
> > > +
> > > + if (total_read == buf_size) {
> > > + warn("File %s truncated, buffer too small", path);
> >
> > The file wasn't truncated (on disk) as this comment might seem to
> > indicate. I'd rather say "File contents exceed buffer size", or
> > "Partial file read", something like that.
> >
> > While at it, you could print the size we read (it's %zu, see similar
> > examples where we print size_t types).
> >
> > > + return -2;
> >
> > Safer to NULL-terminate also in this case, perhaps? A future caller
> > might handle -2 (or equivalent) as a "partial" failure and use the
> > buffer anyway, so not NULL-terminating it is rather subtle.
>
> That's a good idea. Given the purpose of the function, I think a
> caller _should_ ignore the buffer if it gets an error, but it's
> worthwhile to limit the damage if a caller forgets to check. That
> applies for other error cases too.
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-24 3:30 ` David Gibson
@ 2025-10-24 8:37 ` Stefano Brivio
2025-10-24 10:55 ` David Gibson
0 siblings, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-24 8:37 UTC (permalink / raw)
To: David Gibson; +Cc: Yumei Huang, passt-dev
On Fri, 24 Oct 2025 14:30:09 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
> > On Fri, 17 Oct 2025 14:28:37 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > If a client connects while guest is not connected or ready yet,
> > > resend SYN instead of just resetting connection after 10 seconds.
> > >
> > > Use the same backoff calculation for the timeout as linux kernel.
> >
> > Linux.
> >
> > >
> > > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > > tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> > > tcp.h | 5 +++++
> > > 2 files changed, 52 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 2ec4b0c..9385132 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -179,9 +179,11 @@
> > > *
> > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > *
> > > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > > - * connection
> > > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > + * SYN. It's the starting timeout for the first SYN retry. If this persists
> >
> > "If this persists" makes sense for the existing ACK_TIMEOUT
> > description but not here, because it looks like it refers to "starting
> > timeout".
> >
> > Coupled with the next patch, it becomes increasingly difficult to
> > understand what "this" persisting thing is.
>
> Yeah. This was my suggested wording, based on the existing wording
> for ACK_TIMEOUT. It's not great, but I struggled a bit to find better
> wording.
>
> > Maybe directly say "Retry for ..., then reset the connection"? It's
> > shorter and clearer.
>
> There it is :). "Retry (NNN) times, then reset the connection".
>
> [snip]
> > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > tcp_timer_ctl(c, conn);
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > if (!(conn->events & ESTABLISHED)) {
> > > - flow_dbg(conn, "handshake timeout");
> > > - tcp_rst(c, conn);
> > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > + c->tcp.syn_linear_timeouts)) {
> > > + flow_dbg(conn, "handshake timeout");
> > > + tcp_rst(c, conn);
> > > + } else {
> > > + flow_trace(conn, "SYN timeout, retry");
> > > + tcp_send_flag(c, conn, SYN);
> > > + conn->retries++;
> >
> > I think I already raised this point on a previous revision: this needs
> > to be zeroed as the connection is established, but I don't see that in
> > the current version.
>
> Yes, you raised that, but then I realised it's already handled. I
> think I put that in the thread, not just direct to Yumei, but maybe
> not? Or it just got lost in the minutiae.
Yes, here:
https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit
this is another example of what I meant about (potential) advantages of
a fully threaded (email) workflow.
In this case, I didn't review v2, which came before you could post this
to my comment on v1, but in a normal case, we could have settled this
earlier, once for all.
> When we receive a SYN-ACK, it will have th->ack_seq advanced a byte
> acknowledging the SYN. tcp_tap_handler() calls
> tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see
> the new ack_seq and clear retries (retrans before this series).
It doesn't look obvious at all to me.
We're unlikely to break it in the future, so I don't think it's fragile
in the long term, but... can one of you double check that it's actually
the case with a manual one-off test?
> > > + tcp_timer_ctl(c, conn);
> > > + }
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > flow_dbg(conn, "FIN timeout");
> > > tcp_rst(c, conn);
> > > @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > > return sl;
> > > }
> > >
> > > +/**
> > > + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
> >
> > They're not initial, they'll be used for all the connections if I
> > understand correctly.
> >
> > Maybe "Get SYN retries sysctl values"? I think the _init() in the
> > function name is also somewhat misleading.
>
> "Get host kernel RTO parameters"? Since we're thinking of extending
> this to cover the RTO upper bound as well as the SYN specific
> parameters.
Ah, maybe yes, better.
> > > + * @c: Execution context
> > > +*/
> > > +void tcp_syn_params_init(struct ctx *c)
> > > +{
> > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > +
> > > + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
> >
> > Why 8? Perhaps a #define would help?
> >
> > > + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
> > > +
> > > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > > +
> > > + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
> >
> > Similar to the comment above: these are not parameters of SYN segments
> > (which would seem to imply TCP options, such as the MSS).
> >
> > We typically don't print C assignments, rather human-readable messages,
> > so that could be "Read sysctl values tcp_syn_retries: ...,
> > syn_linear_timeouts: ...".
> >
> >
> > > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > > +}
> > > +
> > > /**
> > > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > > * @c: Execution context
> > > @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> > > {
> > > ASSERT(!c->no_tcp);
> > >
> > > + tcp_syn_params_init(c);
> > > +
> > > tcp_sock_iov_init(c);
> > >
> > > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > > diff --git a/tcp.h b/tcp.h
> > > index 234a803..4369b52 100644
> > > --- a/tcp.h
> > > +++ b/tcp.h
> > > @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> > > * @fwd_out: Port forwarding configuration for outbound packets
> > > * @timer_run: Timestamp of most recent timer run
> > > * @pipe_size: Size of pipes for spliced connections
> > > + * @tcp_syn_retries: Number of SYN retries during handshake
> > > + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> > > + * before switching to exponential backoff timeout
> >
> > Maybe more compact:
> >
> > * @syn_linear_timeouts: SYN retries before using exponential timeout
> >
> > > */
> > > struct tcp_ctx {
> > > struct fwd_ports fwd_in;
> > > struct fwd_ports fwd_out;
> > > struct timespec timer_run;
> > > size_t pipe_size;
> > > + uint8_t tcp_syn_retries;
> > > + uint8_t syn_linear_timeouts;
> > > };
> > >
> > > #endif /* TCP_H */
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-24 8:37 ` Stefano Brivio
@ 2025-10-24 10:55 ` David Gibson
2025-10-27 3:37 ` Yumei Huang
0 siblings, 1 reply; 44+ messages in thread
From: David Gibson @ 2025-10-24 10:55 UTC (permalink / raw)
To: Stefano Brivio; +Cc: Yumei Huang, passt-dev
[-- Attachment #1: Type: text/plain, Size: 3342 bytes --]
On Fri, Oct 24, 2025 at 10:37:17AM +0200, Stefano Brivio wrote:
> On Fri, 24 Oct 2025 14:30:09 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
> > > On Fri, 17 Oct 2025 14:28:37 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
[snip]
> > > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > > tcp_timer_ctl(c, conn);
> > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > if (!(conn->events & ESTABLISHED)) {
> > > > - flow_dbg(conn, "handshake timeout");
> > > > - tcp_rst(c, conn);
> > > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > > + c->tcp.syn_linear_timeouts)) {
> > > > + flow_dbg(conn, "handshake timeout");
> > > > + tcp_rst(c, conn);
> > > > + } else {
> > > > + flow_trace(conn, "SYN timeout, retry");
> > > > + tcp_send_flag(c, conn, SYN);
> > > > + conn->retries++;
> > >
> > > I think I already raised this point on a previous revision: this needs
> > > to be zeroed as the connection is established, but I don't see that in
> > > the current version.
> >
> > Yes, you raised that, but then I realised it's already handled. I
> > think I put that in the thread, not just direct to Yumei, but maybe
> > not? Or it just got lost in the minutiae.
>
> Yes, here:
>
> https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit
>
> this is another example of what I meant about (potential) advantages of
> a fully threaded (email) workflow.
>
> In this case, I didn't review v2, which came before you could post this
> to my comment on v1, but in a normal case, we could have settled this
> earlier, once for all.
Ah, right, that'd do it.
> > When we receive a SYN-ACK, it will have th->ack_seq advanced a byte
> > acknowledging the SYN. tcp_tap_handler() calls
> > tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see
> > the new ack_seq and clear retries (retrans before this series).
>
> It doesn't look obvious at all to me.
Oh, it's definitely not obvious, but I'm pretty confident it's
correct. Fwiw, I spotted this because I thought the explicit handling
in v2 wasn't at quite the right point logically (though close enough
to be fine in practice). I went looking for the precise right point -
when we receive the SYN-ACK - and there it was, already handled.
It does make a kind of logical sense. The RFCs don't generally treat
SYN (or SYN-ACK, or FIN) retransmits any differently from data
retransmits. We do treat them differently, but less so after this
series, which is a good thing, I think.
> We're unlikely to break it in the future, so I don't think it's fragile
> in the long term, but... can one of you double check that it's actually
> the case with a manual one-off test?
Yeah, I guess that's wise. Easiest way is probably to add a temporary
debug message here, and try it against a qemu guest that's temporarily
suspended. Yumei, I can walk you through this, too.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-24 10:55 ` David Gibson
@ 2025-10-27 3:37 ` Yumei Huang
2025-10-27 6:49 ` Stefano Brivio
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-27 3:37 UTC (permalink / raw)
To: David Gibson; +Cc: Stefano Brivio, passt-dev
On Fri, Oct 24, 2025 at 6:56 PM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 24, 2025 at 10:37:17AM +0200, Stefano Brivio wrote:
> > On Fri, 24 Oct 2025 14:30:09 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
> > > > On Fri, 17 Oct 2025 14:28:37 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> [snip]
> > > > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > > > tcp_timer_ctl(c, conn);
> > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > - flow_dbg(conn, "handshake timeout");
> > > > > - tcp_rst(c, conn);
> > > > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > > > + c->tcp.syn_linear_timeouts)) {
> > > > > + flow_dbg(conn, "handshake timeout");
> > > > > + tcp_rst(c, conn);
> > > > > + } else {
> > > > > + flow_trace(conn, "SYN timeout, retry");
> > > > > + tcp_send_flag(c, conn, SYN);
> > > > > + conn->retries++;
> > > >
> > > > I think I already raised this point on a previous revision: this needs
> > > > to be zeroed as the connection is established, but I don't see that in
> > > > the current version.
> > >
> > > Yes, you raised that, but then I realised it's already handled. I
> > > think I put that in the thread, not just direct to Yumei, but maybe
> > > not? Or it just got lost in the minutiae.
> >
> > Yes, here:
> >
> > https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit
> >
> > this is another example of what I meant about (potential) advantages of
> > a fully threaded (email) workflow.
> >
> > In this case, I didn't review v2, which came before you could post this
> > to my comment on v1, but in a normal case, we could have settled this
> > earlier, once for all.
>
> Ah, right, that'd do it.
>
> > > When we receive a SYN-ACK, it will have th->ack_seq advanced a byte
> > > acknowledging the SYN. tcp_tap_handler() calls
> > > tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see
> > > the new ack_seq and clear retries (retrans before this series).
> >
> > It doesn't look obvious at all to me.
>
> Oh, it's definitely not obvious, but I'm pretty confident it's
> correct. Fwiw, I spotted this because I thought the explicit handling
> in v2 wasn't at quite the right point logically (though close enough
> to be fine in practice). I went looking for the precise right point -
> when we receive the SYN-ACK - and there it was, already handled.
>
> It does make a kind of logical sense. The RFCs don't generally treat
> SYN (or SYN-ACK, or FIN) retransmits any differently from data
> retransmits. We do treat them differently, but less so after this
> series, which is a good thing, I think.
>
> > We're unlikely to break it in the future, so I don't think it's fragile
> > in the long term, but... can one of you double check that it's actually
> > the case with a manual one-off test?
>
> Yeah, I guess that's wise. Easiest way is probably to add a temporary
> debug message here, and try it against a qemu guest that's temporarily
> suspended. Yumei, I can walk you through this, too.
Thank you, I've verified this by adding debug messages in the if block:
if (th->ack && !(conn->events & ESTABLISHED))
tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
conn->retries gets reset in tcp_update_seqack_from_tap().
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-27 3:37 ` Yumei Huang
@ 2025-10-27 6:49 ` Stefano Brivio
0 siblings, 0 replies; 44+ messages in thread
From: Stefano Brivio @ 2025-10-27 6:49 UTC (permalink / raw)
To: Yumei Huang; +Cc: David Gibson, passt-dev
On Mon, 27 Oct 2025 11:37:20 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Fri, Oct 24, 2025 at 6:56 PM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > On Fri, Oct 24, 2025 at 10:37:17AM +0200, Stefano Brivio wrote:
> > > On Fri, 24 Oct 2025 14:30:09 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
> > > > > On Fri, 17 Oct 2025 14:28:37 +0800
> > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > [snip]
> > > > > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > > > > tcp_timer_ctl(c, conn);
> > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > - flow_dbg(conn, "handshake timeout");
> > > > > > - tcp_rst(c, conn);
> > > > > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > > > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > > > > + c->tcp.syn_linear_timeouts)) {
> > > > > > + flow_dbg(conn, "handshake timeout");
> > > > > > + tcp_rst(c, conn);
> > > > > > + } else {
> > > > > > + flow_trace(conn, "SYN timeout, retry");
> > > > > > + tcp_send_flag(c, conn, SYN);
> > > > > > + conn->retries++;
> > > > >
> > > > > I think I already raised this point on a previous revision: this needs
> > > > > to be zeroed as the connection is established, but I don't see that in
> > > > > the current version.
> > > >
> > > > Yes, you raised that, but then I realised it's already handled. I
> > > > think I put that in the thread, not just direct to Yumei, but maybe
> > > > not? Or it just got lost in the minutiae.
> > >
> > > Yes, here:
> > >
> > > https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit
> > >
> > > this is another example of what I meant about (potential) advantages of
> > > a fully threaded (email) workflow.
> > >
> > > In this case, I didn't review v2, which came before you could post this
> > > to my comment on v1, but in a normal case, we could have settled this
> > > earlier, once for all.
> >
> > Ah, right, that'd do it.
> >
> > > > When we receive a SYN-ACK, it will have th->ack_seq advanced a byte
> > > > acknowledging the SYN. tcp_tap_handler() calls
> > > > tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see
> > > > the new ack_seq and clear retries (retrans before this series).
> > >
> > > It doesn't look obvious at all to me.
> >
> > Oh, it's definitely not obvious, but I'm pretty confident it's
> > correct. Fwiw, I spotted this because I thought the explicit handling
> > in v2 wasn't at quite the right point logically (though close enough
> > to be fine in practice). I went looking for the precise right point -
> > when we receive the SYN-ACK - and there it was, already handled.
> >
> > It does make a kind of logical sense. The RFCs don't generally treat
> > SYN (or SYN-ACK, or FIN) retransmits any differently from data
> > retransmits. We do treat them differently, but less so after this
> > series, which is a good thing, I think.
> >
> > > We're unlikely to break it in the future, so I don't think it's fragile
> > > in the long term, but... can one of you double check that it's actually
> > > the case with a manual one-off test?
> >
> > Yeah, I guess that's wise. Easiest way is probably to add a temporary
> > debug message here, and try it against a qemu guest that's temporarily
> > suspended. Yumei, I can walk you through this, too.
>
> Thank you, I've verified this by adding debug messages in the if block:
>
> if (th->ack && !(conn->events & ESTABLISHED))
> tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
>
> conn->retries gets reset in tcp_update_seqack_from_tap().
Great, thanks for checking.
Maybe we could mention this in a comment, but I'm not exactly sure
where we would add this comment, so unless somebody has a good idea we
can also skip that.
I don't have further comments on v6 by the way, the rest of the series
looks good to me.
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-24 3:16 ` David Gibson
2025-10-24 6:05 ` Yumei Huang
@ 2025-10-28 7:11 ` Yumei Huang
2025-10-28 11:43 ` Stefano Brivio
1 sibling, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-28 7:11 UTC (permalink / raw)
To: David Gibson; +Cc: Stefano Brivio, passt-dev
On Fri, Oct 24, 2025 at 11:30 AM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
> > Sorry for the delay, mostly nits but a couple of substantial comments:
> >
> > On Fri, 17 Oct 2025 14:28:36 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > util.h | 8 ++++++
> > > 2 files changed, 92 insertions(+)
> > >
> > > diff --git a/util.c b/util.c
> > > index c492f90..5c8c4bc 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > > return len == 0 ? 0 : -1;
> > > }
> > >
> > > +/**
> > > + * read_file() - Read contents of file into a buffer
> > > + * @path: File to read
> >
> > I see this is the same as write_file(), so in some sense it's
> > pre-existing, but @path isn't really a "file" in the sense that it's
> > not a file descriptor as one might expect from the description alone.
> >
> > I'd rather say "Path to file" or "Path to file to read" or something
> > like that. On the other hand, if you want to keep this consistent with
> > write_file(), never mind. Not a strong preference from me.
>
> That's a good idea, but it's not crucial to the aim of this series, so
> I'd suggest doing it as a later patch.
Thank you. As I have to respin and this a minor change, I will update
that in v7.
>
> > > + * @buf: Buffer to store file contents
> > > + * @buf_size: Size of buffer
> > > + *
> > > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> >
> > Similar comment here: this is partially symmetric to read_file, but
> > it's yet another convention we are introducing, because of the -2
> > special value.
> >
> > Other somewhat related functions in util.c return with a meaningful
> > errno value set, this one doesn't.
> >
> > The majority of helpers in passt, though, return with a negative
> > errno-like value, and truncation can be very well represented by
> > returning -ENOBUFS, see snprintf_check(). I think that's preferable.
> >
> > Again, if the intention is to make this consistent to write_file(), it
> > can be left as it is.
>
> Similarly. I considered commenting earlier on the -2 or truncation -
> we don't actually use this, and it's a bit ugly. On the other hand it
> doesn't hurt anything, so again, I think it can wait.
Same here.
>
> > > +*/
> > > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > > +{
> > > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > > + size_t total_read = 0;
> > > + ssize_t rc;
> > > +
> > > + if (fd < 0) {
> > > + warn_perror("Could not open %s", path);
> > > + return -1;
> > > + }
> > > +
> > > + while (total_read < buf_size) {
> > > + rc = read(fd, buf + total_read, buf_size - total_read);
> > > +
> > > + if (rc < 0) {
> > > + warn_perror("Couldn't read from %s", path);
> > > + close(fd);
> > > + return -1;
> > > + }
> > > +
> > > + if (rc == 0)
> > > + break;
> > > +
> > > + total_read += rc;
> >
> > Coverity Scan (I can provide instructions separately if desired)
> > reports one issue below, but I'll mention it here for clarity: you are
> > adding 'rc', of type ssize_t, to total_read, of type size_t, and
> > buf_size is also of type size_t, so you could overflow total_read by
> > adding for example the maximum value for ssize_t twice, to it.
> >
> > We can't run into the (theoretical) issue fixed by d836d9e34586 ("util:
> > Remove possible quadratic behaviour from write_remainder()") but the
> > solution here might be similar.
> >
> > In general we should make sure that rc is less than whatever value we
> > might sum to total_read to make it overflow at any point in time.
> >
> > I didn't really check this in detail, I can do that if needed, and
> > perhaps David remembers more clearly what we did in a similar
> > situation. It might also be a false positive, by the way.
>
> I think there are two slightly overlapping issues here.
>
> 1) I'm not sure Coverity knows/trusts that read() will never return
> more than its third argument. That's what stops total_read from
> ever exceeding buf_size. I'd need to think a bit harder about how
> to convince it that's the case.
>
> 2) buf_size is size_t, but we're returning ssize_t. If we passed a
> buf_size greater than ssize_t can hold, it would make a mess (UB, I
> think). I don't think there are any perfectly elegant solutions in
> C, so I'd suggest:
> ASSERT(buf_size <= SSIZE_MAX);
>
> at the top of the function.
>
> I'd try (2) first because it's a real (if unlikely to be triggered)
> bug. Then we can see if Coverity still complains (Yumei, I can walk
> you through how to install and run Coverity locally using Red Hat's
> subscription).
Coverity doesn't complain about it in my setup. Stefano may give more
info on that. In a word, it's a false positive.
>
> [snip]
> > > + }
> > > +
> > > + close(fd);
> > > +
> > > + if (total_read == buf_size) {
> > > + warn("File %s truncated, buffer too small", path);
> >
> > The file wasn't truncated (on disk) as this comment might seem to
> > indicate. I'd rather say "File contents exceed buffer size", or
> > "Partial file read", something like that.
> >
> > While at it, you could print the size we read (it's %zu, see similar
> > examples where we print size_t types).
> >
> > > + return -2;
> >
> > Safer to NULL-terminate also in this case, perhaps? A future caller
> > might handle -2 (or equivalent) as a "partial" failure and use the
> > buffer anyway, so not NULL-terminating it is rather subtle.
>
> That's a good idea. Given the purpose of the function, I think a
> caller _should_ ignore the buffer if it gets an error, but it's
> worthwhile to limit the damage if a caller forgets to check. That
> applies for other error cases too.
The rest will update in v7 as well.
>
> --
> David Gibson (he or they) | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
> | around.
> http://www.ozlabs.org/~dgibson
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-23 23:04 ` Stefano Brivio
2025-10-24 3:30 ` David Gibson
@ 2025-10-28 7:43 ` Yumei Huang
2025-10-28 11:44 ` Stefano Brivio
1 sibling, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-28 7:43 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 17 Oct 2025 14:28:37 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > If a client connects while guest is not connected or ready yet,
> > resend SYN instead of just resetting connection after 10 seconds.
> >
> > Use the same backoff calculation for the timeout as linux kernel.
>
> Linux.
>
> >
> > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > ---
> > tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> > tcp.h | 5 +++++
> > 2 files changed, 52 insertions(+), 8 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 2ec4b0c..9385132 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -179,9 +179,11 @@
> > *
> > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > *
> > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > - * connection
> > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > + * SYN. It's the starting timeout for the first SYN retry. If this persists
>
> "If this persists" makes sense for the existing ACK_TIMEOUT
> description but not here, because it looks like it refers to "starting
> timeout".
>
> Coupled with the next patch, it becomes increasingly difficult to
> understand what "this" persisting thing is.
>
> Maybe directly say "Retry for ..., then reset the connection"? It's
> shorter and clearer.
>
> > + * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > + * tcp_syn_linear_timeouts) times in a row, reset the connection
> > *
> > * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > @@ -340,7 +342,7 @@ enum {
> > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> >
> > #define ACK_INTERVAL 10 /* ms */
> > -#define SYN_TIMEOUT 10 /* s */
> > +#define SYN_TIMEOUT_INIT 1 /* s */
>
> Maybe mention RFC 6928 as done above? That's where this value comes
> from.
>
> I just noticed you do that in 4/4, so it's slightly nicer if you do
> that right away here for ease of future reference, but not really
> needed.
>
> > #define ACK_TIMEOUT 2
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> > @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> >
> > #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> >
> > +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
>
> This uses 121 columns. I'm not sure where all those tabs and \ come
> from.
My bad. I renamed the macro and it fits in 80 columns but I forgot to
remove the '\' and the spaces before that.
>
> > +
> > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> >
> > @@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > - if (!(conn->events & ESTABLISHED))
> > - it.it_value.tv_sec = SYN_TIMEOUT;
> > + if (!(conn->events & ESTABLISHED)) {
> > + if (conn->retries < c->tcp.syn_linear_timeouts)
> > + it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > + else
> > + it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > + (conn->retries - c->tcp.syn_linear_timeouts);
>
> Probably more readable, but I haven't tried: always start from
> SYN_TIMEOUT_INIT, then multiply/shift if
> conn->retries >= c->tcp.syn_linear_timeouts.
I guess you meant:
if (!(conn->events & ESTABLISHED)) {
it.it_value.tv_sec = SYN_TIMEOUT_INIT;
if (conn->retries >= c->tcp.syn_linear_timeouts)
it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
(conn->retries - c->tcp.syn_linear_timeouts);
other than something like:
it.it_value.tv_sec <<= 1
in above second if block, since the latter would cause
it.it_value.tv_sec=2 always when retries>=syn_linear_timeouts, as
it.it_value.tv_sec is set to SYN_TIMEOUT_INIT before the if condition.
If I'm correct, I'm not sure if it's more readable to change to that.
I feel the if/else clause quite matches the way the kernel handles it.
What do you think?
>
> > + }
> > else
> > it.it_value.tv_sec = ACK_TIMEOUT;
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > tcp_timer_ctl(c, conn);
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > if (!(conn->events & ESTABLISHED)) {
> > - flow_dbg(conn, "handshake timeout");
> > - tcp_rst(c, conn);
> > + if (conn->retries >= TCP_MAX_RETRIES ||
> > + conn->retries >= (c->tcp.tcp_syn_retries +
> > + c->tcp.syn_linear_timeouts)) {
> > + flow_dbg(conn, "handshake timeout");
> > + tcp_rst(c, conn);
> > + } else {
> > + flow_trace(conn, "SYN timeout, retry");
> > + tcp_send_flag(c, conn, SYN);
> > + conn->retries++;
>
> I think I already raised this point on a previous revision: this needs
> to be zeroed as the connection is established, but I don't see that in
> the current version.
>
> > + tcp_timer_ctl(c, conn);
> > + }
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > flow_dbg(conn, "FIN timeout");
> > tcp_rst(c, conn);
> > @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > return sl;
> > }
> >
> > +/**
> > + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
>
> They're not initial, they'll be used for all the connections if I
> understand correctly.
>
> Maybe "Get SYN retries sysctl values"?
> I think the _init() in the
> function name is also somewhat misleading.
Do you have a better idea for the function name? I named it just like
other functions called in tcp_init(), such as tcp_sock_iov_init(),
tcp_splice_init().
>
> > + * @c: Execution context
> > +*/
> > +void tcp_syn_params_init(struct ctx *c)
> > +{
> > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > +
> > + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
>
> Why 8? Perhaps a #define would help?
The 8 actually comes from your suggestion in v1:
"
(optional, and might be in another series, but keeping it together
with the rest might be more convenient): read tcp_syn_retries, limit
to 8, and also read tcp_syn_linear_timeouts
"
I guess it can be replaced with 6 as it's the default value according to
Documentation/networking/ip-sysctl.rst.
We already have TCP_SYN_RETRIES for the sysctl file, maybe have
another TCP_SYN_RETRIES_DEFAULT for the number?
>
> > + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
The default value is 4 as Documentation/networking/ip-sysctl.rst. I
don't quite remember where the 1 is from. Should we update it as well
and add the similar macro as above?
> > +
> > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > +
> > + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
>
> Similar to the comment above: these are not parameters of SYN segments
> (which would seem to imply TCP options, such as the MSS).
>
> We typically don't print C assignments, rather human-readable messages,
> so that could be "Read sysctl values tcp_syn_retries: ...,
> syn_linear_timeouts: ...".
Got it.
>
>
> > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > +}
> > +
> > /**
> > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > * @c: Execution context
> > @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> > {
> > ASSERT(!c->no_tcp);
> >
> > + tcp_syn_params_init(c);
> > +
> > tcp_sock_iov_init(c);
> >
> > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > diff --git a/tcp.h b/tcp.h
> > index 234a803..4369b52 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> > * @fwd_out: Port forwarding configuration for outbound packets
> > * @timer_run: Timestamp of most recent timer run
> > * @pipe_size: Size of pipes for spliced connections
> > + * @tcp_syn_retries: Number of SYN retries during handshake
> > + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> > + * before switching to exponential backoff timeout
>
> Maybe more compact:
>
> * @syn_linear_timeouts: SYN retries before using exponential timeout
So we remove the '"Number of" as well? I guess we should make it
consistent for both tcp_syn_retries and syn_linear_timeouts.
>
> > */
> > struct tcp_ctx {
> > struct fwd_ports fwd_in;
> > struct fwd_ports fwd_out;
> > struct timespec timer_run;
> > size_t pipe_size;
> > + uint8_t tcp_syn_retries;
> > + uint8_t syn_linear_timeouts;
> > };
> >
> > #endif /* TCP_H */
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-23 23:04 ` Stefano Brivio
@ 2025-10-28 8:09 ` Yumei Huang
2025-10-28 11:44 ` Stefano Brivio
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-28 8:09 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 17 Oct 2025 14:28:38 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > Use an exponential backoff timeout for data retransmission according
> > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > in Appendix A of RFC 6298.
> >
> > Also combine the macros defining the initial RTO for both SYN and ACK.
> >
> > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp.c | 27 ++++++++++++---------------
> > 1 file changed, 12 insertions(+), 15 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 9385132..dc0ec6c 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -179,16 +179,14 @@
> > *
> > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > *
> > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > - *
> > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > - * socket and reset sequence to what was acknowledged. If this persists for
> > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > + * from the socket and reset sequence to what was acknowledged. This is the
> > + * timeout for the first retry, in seconds. If this persists too many times
> > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > + * handshake.
> > *
> > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > @@ -342,8 +340,7 @@ enum {
> > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> >
> > #define ACK_INTERVAL 10 /* ms */
> > -#define SYN_TIMEOUT_INIT 1 /* s */
> > -#define ACK_TIMEOUT 2
> > +#define RTO_INIT 1 /* s, RFC 6298 */
> > #define FIN_TIMEOUT 60
> > #define ACT_TIMEOUT 7200
> >
> > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > if (!(conn->events & ESTABLISHED)) {
> > if (conn->retries < c->tcp.syn_linear_timeouts)
> > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > + it.it_value.tv_sec = RTO_INIT;
> > else
> > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > + it.it_value.tv_sec = RTO_INIT <<
> > (conn->retries - c->tcp.syn_linear_timeouts);
> > }
> > else
> > - it.it_value.tv_sec = ACK_TIMEOUT;
> > + it.it_value.tv_sec = RTO_INIT << conn->retries;
>
> Same as on 3/4, but here it's clearly more convenient: just assign
> RTO_INIT, and multiply as needed in the if / else clauses.
I guess we can't just assign RTO_INIT. Maybe assign it only when
retries==0, otherwise multiply as it.it_value.tv_sec <<=1. But it
seems more complicated. What do you think?
>
> > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > it.it_value.tv_sec = FIN_TIMEOUT;
> > } else {
>
> The rest of the series looks good to me.
>
> It might be slightly more practical to factor in directly the RTO
> clamp, and I don't think it's complicated now that you have the helper
> from 2/4, but it's not a strong preference from my side, as the series
> makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is
about the way we get the total time for retries.
I used to do it like this in v2,
https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
+#define RETRY_ELAPSED(timeout_init, retries) \
+ ((timeout_init) * ((1 << ((retries) + 1)) - 2))
Though the formula is not quite right, we could refine it as below:
#define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
Does it make sense to get the time this way?
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function
2025-10-28 7:11 ` Yumei Huang
@ 2025-10-28 11:43 ` Stefano Brivio
0 siblings, 0 replies; 44+ messages in thread
From: Stefano Brivio @ 2025-10-28 11:43 UTC (permalink / raw)
To: Yumei Huang; +Cc: David Gibson, passt-dev
On Tue, 28 Oct 2025 15:11:32 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Fri, Oct 24, 2025 at 11:30 AM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
> > > Sorry for the delay, mostly nits but a couple of substantial comments:
> > >
> > > On Fri, 17 Oct 2025 14:28:36 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > ---
> > > > util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > util.h | 8 ++++++
> > > > 2 files changed, 92 insertions(+)
> > > >
> > > > diff --git a/util.c b/util.c
> > > > index c492f90..5c8c4bc 100644
> > > > --- a/util.c
> > > > +++ b/util.c
> > > > @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf)
> > > > return len == 0 ? 0 : -1;
> > > > }
> > > >
> > > > +/**
> > > > + * read_file() - Read contents of file into a buffer
> > > > + * @path: File to read
> > >
> > > I see this is the same as write_file(), so in some sense it's
> > > pre-existing, but @path isn't really a "file" in the sense that it's
> > > not a file descriptor as one might expect from the description alone.
> > >
> > > I'd rather say "Path to file" or "Path to file to read" or something
> > > like that. On the other hand, if you want to keep this consistent with
> > > write_file(), never mind. Not a strong preference from me.
> >
> > That's a good idea, but it's not crucial to the aim of this series, so
> > I'd suggest doing it as a later patch.
>
> Thank you. As I have to respin and this a minor change, I will update
> that in v7.
Just for clarity: I think David meant that it *should* be done as a
separate patch (maybe in the same series, if you have a moment, but it's
not that critical).
I don't have a particular preference, but David's point makes sense to
me.
> > > > + * @buf: Buffer to store file contents
> > > > + * @buf_size: Size of buffer
> > > > + *
> > > > + * Return: number of bytes read on success, -1 on any error, -2 on truncation
> > >
> > > Similar comment here: this is partially symmetric to read_file, but
> > > it's yet another convention we are introducing, because of the -2
> > > special value.
> > >
> > > Other somewhat related functions in util.c return with a meaningful
> > > errno value set, this one doesn't.
> > >
> > > The majority of helpers in passt, though, return with a negative
> > > errno-like value, and truncation can be very well represented by
> > > returning -ENOBUFS, see snprintf_check(). I think that's preferable.
> > >
> > > Again, if the intention is to make this consistent to write_file(), it
> > > can be left as it is.
> >
> > Similarly. I considered commenting earlier on the -2 or truncation -
> > we don't actually use this, and it's a bit ugly. On the other hand it
> > doesn't hurt anything, so again, I think it can wait.
>
> Same here.
Same as above ("I think it can wait"). Well, up to you, really.
> > > > +*/
> > > > +ssize_t read_file(const char *path, char *buf, size_t buf_size)
> > > > +{
> > > > + int fd = open(path, O_RDONLY | O_CLOEXEC);
> > > > + size_t total_read = 0;
> > > > + ssize_t rc;
> > > > +
> > > > + if (fd < 0) {
> > > > + warn_perror("Could not open %s", path);
> > > > + return -1;
> > > > + }
> > > > +
> > > > + while (total_read < buf_size) {
> > > > + rc = read(fd, buf + total_read, buf_size - total_read);
> > > > +
> > > > + if (rc < 0) {
> > > > + warn_perror("Couldn't read from %s", path);
> > > > + close(fd);
> > > > + return -1;
> > > > + }
> > > > +
> > > > + if (rc == 0)
> > > > + break;
> > > > +
> > > > + total_read += rc;
> > >
> > > Coverity Scan (I can provide instructions separately if desired)
> > > reports one issue below, but I'll mention it here for clarity: you are
> > > adding 'rc', of type ssize_t, to total_read, of type size_t, and
> > > buf_size is also of type size_t, so you could overflow total_read by
> > > adding for example the maximum value for ssize_t twice, to it.
> > >
> > > We can't run into the (theoretical) issue fixed by d836d9e34586 ("util:
> > > Remove possible quadratic behaviour from write_remainder()") but the
> > > solution here might be similar.
> > >
> > > In general we should make sure that rc is less than whatever value we
> > > might sum to total_read to make it overflow at any point in time.
> > >
> > > I didn't really check this in detail, I can do that if needed, and
> > > perhaps David remembers more clearly what we did in a similar
> > > situation. It might also be a false positive, by the way.
> >
> > I think there are two slightly overlapping issues here.
> >
> > 1) I'm not sure Coverity knows/trusts that read() will never return
> > more than its third argument. That's what stops total_read from
> > ever exceeding buf_size. I'd need to think a bit harder about how
> > to convince it that's the case.
> >
> > 2) buf_size is size_t, but we're returning ssize_t. If we passed a
> > buf_size greater than ssize_t can hold, it would make a mess (UB, I
> > think). I don't think there are any perfectly elegant solutions in
> > C, so I'd suggest:
> > ASSERT(buf_size <= SSIZE_MAX);
> >
> > at the top of the function.
> >
> > I'd try (2) first because it's a real (if unlikely to be triggered)
> > bug. Then we can see if Coverity still complains (Yumei, I can walk
> > you through how to install and run Coverity locally using Red Hat's
> > subscription).
>
> Coverity doesn't complain about it in my setup. Stefano may give more
> info on that. In a word, it's a false positive.
Right, my bad, it turns out I was using an outdated version and this
false positive doesn't appear anymore.
> > [snip]
> > > > + }
> > > > +
> > > > + close(fd);
> > > > +
> > > > + if (total_read == buf_size) {
> > > > + warn("File %s truncated, buffer too small", path);
> > >
> > > The file wasn't truncated (on disk) as this comment might seem to
> > > indicate. I'd rather say "File contents exceed buffer size", or
> > > "Partial file read", something like that.
> > >
> > > While at it, you could print the size we read (it's %zu, see similar
> > > examples where we print size_t types).
> > >
> > > > + return -2;
> > >
> > > Safer to NULL-terminate also in this case, perhaps? A future caller
> > > might handle -2 (or equivalent) as a "partial" failure and use the
> > > buffer anyway, so not NULL-terminating it is rather subtle.
> >
> > That's a good idea. Given the purpose of the function, I think a
> > caller _should_ ignore the buffer if it gets an error, but it's
> > worthwhile to limit the damage if a caller forgets to check. That
> > applies for other error cases too.
>
> The rest will update in v7 as well.
Thanks.
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-28 7:43 ` Yumei Huang
@ 2025-10-28 11:44 ` Stefano Brivio
2025-10-29 2:31 ` Yumei Huang
0 siblings, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-28 11:44 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Tue, 28 Oct 2025 15:43:25 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 17 Oct 2025 14:28:37 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > If a client connects while guest is not connected or ready yet,
> > > resend SYN instead of just resetting connection after 10 seconds.
> > >
> > > Use the same backoff calculation for the timeout as linux kernel.
> >
> > Linux.
> >
> > >
> > > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > ---
> > > tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> > > tcp.h | 5 +++++
> > > 2 files changed, 52 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 2ec4b0c..9385132 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -179,9 +179,11 @@
> > > *
> > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > *
> > > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > > - * connection
> > > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > + * SYN. It's the starting timeout for the first SYN retry. If this persists
> >
> > "If this persists" makes sense for the existing ACK_TIMEOUT
> > description but not here, because it looks like it refers to "starting
> > timeout".
> >
> > Coupled with the next patch, it becomes increasingly difficult to
> > understand what "this" persisting thing is.
> >
> > Maybe directly say "Retry for ..., then reset the connection"? It's
> > shorter and clearer.
> >
> > > + * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > + * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > *
> > > * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > @@ -340,7 +342,7 @@ enum {
> > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > >
> > > #define ACK_INTERVAL 10 /* ms */
> > > -#define SYN_TIMEOUT 10 /* s */
> > > +#define SYN_TIMEOUT_INIT 1 /* s */
> >
> > Maybe mention RFC 6928 as done above? That's where this value comes
> > from.
> >
> > I just noticed you do that in 4/4, so it's slightly nicer if you do
> > that right away here for ease of future reference, but not really
> > needed.
> >
> > > #define ACK_TIMEOUT 2
> > > #define FIN_TIMEOUT 60
> > > #define ACT_TIMEOUT 7200
> > > @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> > >
> > > #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> > >
> > > +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > > +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
> >
> > This uses 121 columns. I'm not sure where all those tabs and \ come
> > from.
>
> My bad. I renamed the macro and it fits in 80 columns but I forgot to
> remove the '\' and the spaces before that.
> >
> > > +
> > > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > >
> > > @@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > - if (!(conn->events & ESTABLISHED))
> > > - it.it_value.tv_sec = SYN_TIMEOUT;
> > > + if (!(conn->events & ESTABLISHED)) {
> > > + if (conn->retries < c->tcp.syn_linear_timeouts)
> > > + it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > + else
> > > + it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > + (conn->retries - c->tcp.syn_linear_timeouts);
> >
> > Probably more readable, but I haven't tried: always start from
> > SYN_TIMEOUT_INIT, then multiply/shift if
> > conn->retries >= c->tcp.syn_linear_timeouts.
>
> I guess you meant:
>
> if (!(conn->events & ESTABLISHED)) {
> it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> if (conn->retries >= c->tcp.syn_linear_timeouts)
> it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> (conn->retries - c->tcp.syn_linear_timeouts);
Well, yes, this, but without repeating the assignment, so
<<= (conn->retries - c->tcp.syn_linear_timeouts).
> other than something like:
>
> it.it_value.tv_sec <<= 1
Hmm, no, why 1?
> in above second if block, since the latter would cause
> it.it_value.tv_sec=2 always when retries>=syn_linear_timeouts, as
> it.it_value.tv_sec is set to SYN_TIMEOUT_INIT before the if condition.
>
> If I'm correct, I'm not sure if it's more readable to change to that.
> I feel the if/else clause quite matches the way the kernel handles it.
> What do you think?
Sorry, I took this for granted, but I rather meant:
it.it_value.tv_sec = SYN_TIMEOUT_INIT;
if (conn->retries >= c->tcp.syn_linear_timeouts) {
it.it_value.tv_sec <<=
(conn->retries -
c->tcp.syn_linear_timeouts);
}
because in both cases we're starting from SYN_TIMEOUT_INIT.
And now that I wrote that, I'll pretend I actually meant this:
if (!(conn->events & ESTABLISHED)) {
int factor = conn->retries - c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(factor, 0);
} else {
...
}
...where we make the conditional implicit (but I think intuitive) by
adding 0 as a lower bound. Am I missing something...?
> > > + }
> > > else
> > > it.it_value.tv_sec = ACK_TIMEOUT;
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > tcp_timer_ctl(c, conn);
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > if (!(conn->events & ESTABLISHED)) {
> > > - flow_dbg(conn, "handshake timeout");
> > > - tcp_rst(c, conn);
> > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > + c->tcp.syn_linear_timeouts)) {
> > > + flow_dbg(conn, "handshake timeout");
> > > + tcp_rst(c, conn);
> > > + } else {
> > > + flow_trace(conn, "SYN timeout, retry");
> > > + tcp_send_flag(c, conn, SYN);
> > > + conn->retries++;
> >
> > I think I already raised this point on a previous revision: this needs
> > to be zeroed as the connection is established, but I don't see that in
> > the current version.
> >
> > > + tcp_timer_ctl(c, conn);
> > > + }
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > flow_dbg(conn, "FIN timeout");
> > > tcp_rst(c, conn);
> > > @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > > return sl;
> > > }
> > >
> > > +/**
> > > + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
> >
> > They're not initial, they'll be used for all the connections if I
> > understand correctly.
> >
> > Maybe "Get SYN retries sysctl values"?
> > I think the _init() in the
> > function name is also somewhat misleading.
>
> Do you have a better idea for the function name? I named it just like
> other functions called in tcp_init(), such as tcp_sock_iov_init(),
> tcp_splice_init().
Given that David was suggesting "Get host kernel RTO parameters" as a
comment to this function, maybe tcp_get_rto_params()?
The only similar function we have is tcp_get_sndbuf(), so I would make
it consistent with that. The most peculiar task of this function is
that it gets parameters ("get"), I think, rather than assigning them to
variables ("init").
> > > + * @c: Execution context
> > > +*/
> > > +void tcp_syn_params_init(struct ctx *c)
> > > +{
> > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > +
> > > + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
> >
> > Why 8? Perhaps a #define would help?
>
> The 8 actually comes from your suggestion in v1:
> "
> (optional, and might be in another series, but keeping it together
> with the rest might be more convenient): read tcp_syn_retries, limit
> to 8, and also read tcp_syn_linear_timeouts
> "
That was a *limit* of 8, coming from the fact that we have 3 bits to
store the retry count. This is a default instead.
> I guess it can be replaced with 6 as it's the default value according to
> Documentation/networking/ip-sysctl.rst.
I'd rather check the code because sometimes people forget to update
that documentation but yes, it matches:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/net/tcp.h?id=a7556779745c047efb7b0ce8732889b0cdc80936#n125
> We already have TCP_SYN_RETRIES for the sysctl file, maybe have
> another TCP_SYN_RETRIES_DEFAULT for the number?
It sounds reasonable to me.
> > > + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
>
> The default value is 4 as Documentation/networking/ip-sysctl.rst. I
> don't quite remember where the 1 is from. Should we update it as well
> and add the similar macro as above?
I would, yes. By the way, it's hardcoded here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_ipv4.c?id=e97269257fe437910cddc7c642a636ca3cf9fb1d#n3669
but defined here for thin streams (connections where observed
throughput is low: https://docs.kernel.org/networking/tcp-thin.html):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/net/tcp.h?id=a7556779745c047efb7b0ce8732889b0cdc80936#n266
I'd suggest to define a constant with value '4' and use it.
> > > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > > +
> > > + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
> >
> > Similar to the comment above: these are not parameters of SYN segments
> > (which would seem to imply TCP options, such as the MSS).
> >
> > We typically don't print C assignments, rather human-readable messages,
> > so that could be "Read sysctl values tcp_syn_retries: ...,
> > syn_linear_timeouts: ...".
>
> Got it.
> >
> >
> > > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > > +}
> > > +
> > > /**
> > > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > > * @c: Execution context
> > > @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> > > {
> > > ASSERT(!c->no_tcp);
> > >
> > > + tcp_syn_params_init(c);
> > > +
> > > tcp_sock_iov_init(c);
> > >
> > > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > > diff --git a/tcp.h b/tcp.h
> > > index 234a803..4369b52 100644
> > > --- a/tcp.h
> > > +++ b/tcp.h
> > > @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> > > * @fwd_out: Port forwarding configuration for outbound packets
> > > * @timer_run: Timestamp of most recent timer run
> > > * @pipe_size: Size of pipes for spliced connections
> > > + * @tcp_syn_retries: Number of SYN retries during handshake
> > > + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> > > + * before switching to exponential backoff timeout
> >
> > Maybe more compact:
> >
> > * @syn_linear_timeouts: SYN retries before using exponential timeout
>
> So we remove the '"Number of" as well? I guess we should make it
> consistent for both tcp_syn_retries and syn_linear_timeouts.
Ah, yes, we should.
I mean, I guess it's clear that we're talking about the count / number
of those and not their colour or taste...
> >
> > > */
> > > struct tcp_ctx {
> > > struct fwd_ports fwd_in;
> > > struct fwd_ports fwd_out;
> > > struct timespec timer_run;
> > > size_t pipe_size;
> > > + uint8_t tcp_syn_retries;
> > > + uint8_t syn_linear_timeouts;
> > > };
> > >
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-28 8:09 ` Yumei Huang
@ 2025-10-28 11:44 ` Stefano Brivio
2025-10-28 11:54 ` Stefano Brivio
2025-10-29 3:06 ` Yumei Huang
0 siblings, 2 replies; 44+ messages in thread
From: Stefano Brivio @ 2025-10-28 11:44 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Tue, 28 Oct 2025 16:09:03 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 17 Oct 2025 14:28:38 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > Use an exponential backoff timeout for data retransmission according
> > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > in Appendix A of RFC 6298.
> > >
> > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > >
> > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > > tcp.c | 27 ++++++++++++---------------
> > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/tcp.c b/tcp.c
> > > index 9385132..dc0ec6c 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -179,16 +179,14 @@
> > > *
> > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > *
> > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > - *
> > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > + * timeout for the first retry, in seconds. If this persists too many times
> > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > + * handshake.
> > > *
> > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > @@ -342,8 +340,7 @@ enum {
> > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > >
> > > #define ACK_INTERVAL 10 /* ms */
> > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > -#define ACK_TIMEOUT 2
> > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > #define FIN_TIMEOUT 60
> > > #define ACT_TIMEOUT 7200
> > >
> > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > if (!(conn->events & ESTABLISHED)) {
> > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > + it.it_value.tv_sec = RTO_INIT;
> > > else
> > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > + it.it_value.tv_sec = RTO_INIT <<
> > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > }
> > > else
> > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> >
> > Same as on 3/4, but here it's clearly more convenient: just assign
> > RTO_INIT, and multiply as needed in the if / else clauses.
>
> I guess we can't just assign RTO_INIT. Maybe assign it only when
> retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
Why can't you do that? Say:
it.it_value.tv_sec = RTO_INIT
if (!(conn->events & ESTABLISHED)) {
if (conn->retries >= c->tcp.syn_linear_timeouts)
it.it_value.tv_sec <<= (conn->retries -
c->tcp.syn_linear_timeouts);
but anyway, see below.
> But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED))
factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
?
> >
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > } else {
> >
> > The rest of the series looks good to me.
> >
> > It might be slightly more practical to factor in directly the RTO
> > clamp, and I don't think it's complicated now that you have the helper
> > from 2/4, but it's not a strong preference from my side, as the series
> > makes sense in any case.
>
> Reading tcp_rto_max_ms can be easy with the helper. My concern is
> about the way we get the total time for retries.
>
> I used to do it like this in v2,
> https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
>
> +#define RETRY_ELAPSED(timeout_init, retries) \
> + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
>
> Though the formula is not quite right, we could refine it as below:
>
> #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
>
> Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right?
But in any case, do you really need to calculate that explicitly? I was
thinking that you can just clamp the value when you use it in
tcp_timer_ctl().
If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
you read from the kernel, then I guess it's just:
---
it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
} ...
if (timerfd_settime(conn->timer, 0, &it, NULL))
flow_perror(conn, "failed to set timer");
---
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-28 11:44 ` Stefano Brivio
@ 2025-10-28 11:54 ` Stefano Brivio
2025-10-29 3:06 ` Yumei Huang
1 sibling, 0 replies; 44+ messages in thread
From: Stefano Brivio @ 2025-10-28 11:54 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Tue, 28 Oct 2025 12:44:26 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:
> Or maybe, building on my latest comment to 3/4:
>
> int factor = conn->retries;
>
> if (!(conn->events & ESTABLISHED))
> factor -= c->tcp.syn_linear_timeouts;
>
> it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
It just occurred to me that 'factor' might be misleading as this is
rather an exponent, so maybe 'exp' instead.
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 3/4] tcp: Resend SYN for inbound connections
2025-10-28 11:44 ` Stefano Brivio
@ 2025-10-29 2:31 ` Yumei Huang
0 siblings, 0 replies; 44+ messages in thread
From: Yumei Huang @ 2025-10-29 2:31 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Tue, 28 Oct 2025 15:43:25 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Fri, 17 Oct 2025 14:28:37 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > If a client connects while guest is not connected or ready yet,
> > > > resend SYN instead of just resetting connection after 10 seconds.
> > > >
> > > > Use the same backoff calculation for the timeout as linux kernel.
> > >
> > > Linux.
> > >
> > > >
> > > > Link: https://bugs.passt.top/show_bug.cgi?id=153
> > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > ---
> > > > tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--------
> > > > tcp.h | 5 +++++
> > > > 2 files changed, 52 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 2ec4b0c..9385132 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -179,9 +179,11 @@
> > > > *
> > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > *
> > > > - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag
> > > > - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the
> > > > - * connection
> > > > + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > + * SYN. It's the starting timeout for the first SYN retry. If this persists
> > >
> > > "If this persists" makes sense for the existing ACK_TIMEOUT
> > > description but not here, because it looks like it refers to "starting
> > > timeout".
> > >
> > > Coupled with the next patch, it becomes increasingly difficult to
> > > understand what "this" persisting thing is.
> > >
> > > Maybe directly say "Retry for ..., then reset the connection"? It's
> > > shorter and clearer.
> > >
> > > > + * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > + * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > *
> > > > * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > @@ -340,7 +342,7 @@ enum {
> > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > >
> > > > #define ACK_INTERVAL 10 /* ms */
> > > > -#define SYN_TIMEOUT 10 /* s */
> > > > +#define SYN_TIMEOUT_INIT 1 /* s */
> > >
> > > Maybe mention RFC 6928 as done above? That's where this value comes
> > > from.
> > >
> > > I just noticed you do that in 4/4, so it's slightly nicer if you do
> > > that right away here for ease of future reference, but not really
> > > needed.
> > >
> > > > #define ACK_TIMEOUT 2
> > > > #define FIN_TIMEOUT 60
> > > > #define ACT_TIMEOUT 7200
> > > > @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
> > > >
> > > > #define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
> > > >
> > > > +#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries"
> > > > +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
> > >
> > > This uses 121 columns. I'm not sure where all those tabs and \ come
> > > from.
> >
> > My bad. I renamed the macro and it fits in 80 columns but I forgot to
> > remove the '\' and the spaces before that.
> > >
> > > > +
> > > > /* "Extended" data (not stored in the flow table) for TCP flow migration */
> > > > static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > > >
> > > > @@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > if (conn->flags & ACK_TO_TAP_DUE) {
> > > > it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > - if (!(conn->events & ESTABLISHED))
> > > > - it.it_value.tv_sec = SYN_TIMEOUT;
> > > > + if (!(conn->events & ESTABLISHED)) {
> > > > + if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > + it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > + else
> > > > + it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > + (conn->retries - c->tcp.syn_linear_timeouts);
> > >
> > > Probably more readable, but I haven't tried: always start from
> > > SYN_TIMEOUT_INIT, then multiply/shift if
> > > conn->retries >= c->tcp.syn_linear_timeouts.
> >
> > I guess you meant:
> >
> > if (!(conn->events & ESTABLISHED)) {
> > it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > (conn->retries - c->tcp.syn_linear_timeouts);
>
> Well, yes, this, but without repeating the assignment, so
> <<= (conn->retries - c->tcp.syn_linear_timeouts).
>
> > other than something like:
> >
> > it.it_value.tv_sec <<= 1
>
> Hmm, no, why 1?
I thought you meant: when retries >= syn_linear_timeouts, we don't use
SYN_TIMEOUT_INIT anymore, but to multiply it.it_value.tv_sec itself,
like if it's 2 this time, next time it's 4 (2<<=1).
Now I get your point, and it makes sense to me.
>
> > in above second if block, since the latter would cause
> > it.it_value.tv_sec=2 always when retries>=syn_linear_timeouts, as
> > it.it_value.tv_sec is set to SYN_TIMEOUT_INIT before the if condition.
> >
> > If I'm correct, I'm not sure if it's more readable to change to that.
> > I feel the if/else clause quite matches the way the kernel handles it.
> > What do you think?
>
> Sorry, I took this for granted, but I rather meant:
>
> it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> if (conn->retries >= c->tcp.syn_linear_timeouts) {
> it.it_value.tv_sec <<=
> (conn->retries -
> c->tcp.syn_linear_timeouts);
> }
>
> because in both cases we're starting from SYN_TIMEOUT_INIT.
>
> And now that I wrote that, I'll pretend I actually meant this:
>
> if (!(conn->events & ESTABLISHED)) {
> int factor = conn->retries - c->tcp.syn_linear_timeouts;
>
> it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(factor, 0);
> } else {
> ...
> }
>
> ...where we make the conditional implicit (but I think intuitive) by
> adding 0 as a lower bound. Am I missing something...?
No, it was me misunderstanding your point. I will update accordingly.
>
> > > > + }
> > > > else
> > > > it.it_value.tv_sec = ACK_TIMEOUT;
> > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref)
> > > > tcp_timer_ctl(c, conn);
> > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > if (!(conn->events & ESTABLISHED)) {
> > > > - flow_dbg(conn, "handshake timeout");
> > > > - tcp_rst(c, conn);
> > > > + if (conn->retries >= TCP_MAX_RETRIES ||
> > > > + conn->retries >= (c->tcp.tcp_syn_retries +
> > > > + c->tcp.syn_linear_timeouts)) {
> > > > + flow_dbg(conn, "handshake timeout");
> > > > + tcp_rst(c, conn);
> > > > + } else {
> > > > + flow_trace(conn, "SYN timeout, retry");
> > > > + tcp_send_flag(c, conn, SYN);
> > > > + conn->retries++;
> > >
> > > I think I already raised this point on a previous revision: this needs
> > > to be zeroed as the connection is established, but I don't see that in
> > > the current version.
> > >
> > > > + tcp_timer_ctl(c, conn);
> > > > + }
> > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > flow_dbg(conn, "FIN timeout");
> > > > tcp_rst(c, conn);
> > > > @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void)
> > > > return sl;
> > > > }
> > > >
> > > > +/**
> > > > + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
> > >
> > > They're not initial, they'll be used for all the connections if I
> > > understand correctly.
> > >
> > > Maybe "Get SYN retries sysctl values"?
> > > I think the _init() in the
> > > function name is also somewhat misleading.
> >
> > Do you have a better idea for the function name? I named it just like
> > other functions called in tcp_init(), such as tcp_sock_iov_init(),
> > tcp_splice_init().
>
> Given that David was suggesting "Get host kernel RTO parameters" as a
> comment to this function, maybe tcp_get_rto_params()?
>
> The only similar function we have is tcp_get_sndbuf(), so I would make
> it consistent with that. The most peculiar task of this function is
> that it gets parameters ("get"), I think, rather than assigning them to
> variables ("init").
Make sense.
>
> > > > + * @c: Execution context
> > > > +*/
> > > > +void tcp_syn_params_init(struct ctx *c)
> > > > +{
> > > > + intmax_t tcp_syn_retries, syn_linear_timeouts;
> > > > +
> > > > + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
> > >
> > > Why 8? Perhaps a #define would help?
> >
> > The 8 actually comes from your suggestion in v1:
> > "
> > (optional, and might be in another series, but keeping it together
> > with the rest might be more convenient): read tcp_syn_retries, limit
> > to 8, and also read tcp_syn_linear_timeouts
> > "
>
> That was a *limit* of 8, coming from the fact that we have 3 bits to
> store the retry count. This is a default instead.
>
> > I guess it can be replaced with 6 as it's the default value according to
> > Documentation/networking/ip-sysctl.rst.
>
> I'd rather check the code because sometimes people forget to update
> that documentation but yes, it matches:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/net/tcp.h?id=a7556779745c047efb7b0ce8732889b0cdc80936#n125
>
> > We already have TCP_SYN_RETRIES for the sysctl file, maybe have
> > another TCP_SYN_RETRIES_DEFAULT for the number?
>
> It sounds reasonable to me.
>
> > > > + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
> >
> > The default value is 4 as Documentation/networking/ip-sysctl.rst. I
> > don't quite remember where the 1 is from. Should we update it as well
> > and add the similar macro as above?
>
> I would, yes. By the way, it's hardcoded here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_ipv4.c?id=e97269257fe437910cddc7c642a636ca3cf9fb1d#n3669
>
> but defined here for thin streams (connections where observed
> throughput is low: https://docs.kernel.org/networking/tcp-thin.html):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/net/tcp.h?id=a7556779745c047efb7b0ce8732889b0cdc80936#n266
>
> I'd suggest to define a constant with value '4' and use it.
Got it.
>
> > > > + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX);
> > > > + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX);
> > > > +
> > > > + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
> > >
> > > Similar to the comment above: these are not parameters of SYN segments
> > > (which would seem to imply TCP options, such as the MSS).
> > >
> > > We typically don't print C assignments, rather human-readable messages,
> > > so that could be "Read sysctl values tcp_syn_retries: ...,
> > > syn_linear_timeouts: ...".
> >
> > Got it.
> > >
> > >
> > > > + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts);
> > > > +}
> > > > +
> > > > /**
> > > > * tcp_init() - Get initial sequence, hash secret, initialise per-socket data
> > > > * @c: Execution context
> > > > @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c)
> > > > {
> > > > ASSERT(!c->no_tcp);
> > > >
> > > > + tcp_syn_params_init(c);
> > > > +
> > > > tcp_sock_iov_init(c);
> > > >
> > > > memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4));
> > > > diff --git a/tcp.h b/tcp.h
> > > > index 234a803..4369b52 100644
> > > > --- a/tcp.h
> > > > +++ b/tcp.h
> > > > @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref {
> > > > * @fwd_out: Port forwarding configuration for outbound packets
> > > > * @timer_run: Timestamp of most recent timer run
> > > > * @pipe_size: Size of pipes for spliced connections
> > > > + * @tcp_syn_retries: Number of SYN retries during handshake
> > > > + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout
> > > > + * before switching to exponential backoff timeout
> > >
> > > Maybe more compact:
> > >
> > > * @syn_linear_timeouts: SYN retries before using exponential timeout
> >
> > So we remove the '"Number of" as well? I guess we should make it
> > consistent for both tcp_syn_retries and syn_linear_timeouts.
>
> Ah, yes, we should.
>
> I mean, I guess it's clear that we're talking about the count / number
> of those and not their colour or taste...
True.
>
> > >
> > > > */
> > > > struct tcp_ctx {
> > > > struct fwd_ports fwd_in;
> > > > struct fwd_ports fwd_out;
> > > > struct timespec timer_run;
> > > > size_t pipe_size;
> > > > + uint8_t tcp_syn_retries;
> > > > + uint8_t syn_linear_timeouts;
> > > > };
> > > >
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-28 11:44 ` Stefano Brivio
2025-10-28 11:54 ` Stefano Brivio
@ 2025-10-29 3:06 ` Yumei Huang
2025-10-29 4:38 ` Stefano Brivio
1 sibling, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-29 3:06 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Tue, 28 Oct 2025 16:09:03 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > Use an exponential backoff timeout for data retransmission according
> > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > in Appendix A of RFC 6298.
> > > >
> > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > >
> > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > > tcp.c | 27 ++++++++++++---------------
> > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 9385132..dc0ec6c 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -179,16 +179,14 @@
> > > > *
> > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > *
> > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > - *
> > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > + * handshake.
> > > > *
> > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > @@ -342,8 +340,7 @@ enum {
> > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > >
> > > > #define ACK_INTERVAL 10 /* ms */
> > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > -#define ACK_TIMEOUT 2
> > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > #define FIN_TIMEOUT 60
> > > > #define ACT_TIMEOUT 7200
> > > >
> > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > if (!(conn->events & ESTABLISHED)) {
> > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > + it.it_value.tv_sec = RTO_INIT;
> > > > else
> > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > }
> > > > else
> > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > >
> > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > RTO_INIT, and multiply as needed in the if / else clauses.
> >
> > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
>
> Why can't you do that? Say:
>
> it.it_value.tv_sec = RTO_INIT
> if (!(conn->events & ESTABLISHED)) {
> if (conn->retries >= c->tcp.syn_linear_timeouts)
> it.it_value.tv_sec <<= (conn->retries -
> c->tcp.syn_linear_timeouts);
>
> but anyway, see below.
>
> > But it seems more complicated. What do you think?
>
> Or maybe, building on my latest comment to 3/4:
>
> int factor = conn->retries;
>
> if (!(conn->events & ESTABLISHED))
> factor -= c->tcp.syn_linear_timeouts;
>
> it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
>
> ?
Yeah, I understand this part now.
>
> > >
> > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > } else {
> > >
> > > The rest of the series looks good to me.
> > >
> > > It might be slightly more practical to factor in directly the RTO
> > > clamp, and I don't think it's complicated now that you have the helper
> > > from 2/4, but it's not a strong preference from my side, as the series
> > > makes sense in any case.
> >
> > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > about the way we get the total time for retries.
> >
> > I used to do it like this in v2,
> > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> >
> > +#define RETRY_ELAPSED(timeout_init, retries) \
> > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> >
> > Though the formula is not quite right, we could refine it as below:
> >
> > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> >
> > Does it make sense to get the time this way?
>
> Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so
syn_linear_timeouts is not relevant. Probably I should name it more
clearly.
>
> But in any case, do you really need to calculate that explicitly? I was
> thinking that you can just clamp the value when you use it in
> tcp_timer_ctl().
>
> If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> you read from the kernel, then I guess it's just:
>
> ---
> it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first
mentioned again, I realized I got something wrong again. I thought it
was for the total timeout for all retries, so I need to calculate that
and decide to reset the connection as in v2.
Anyway, you are right. We don't need to do that. Thanks for your patience.
> } ...
>
> if (timerfd_settime(conn->timer, 0, &it, NULL))
> flow_perror(conn, "failed to set timer");
> ---
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 3:06 ` Yumei Huang
@ 2025-10-29 4:38 ` Stefano Brivio
2025-10-29 5:11 ` Yumei Huang
0 siblings, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-29 4:38 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Wed, 29 Oct 2025 11:06:44 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Tue, 28 Oct 2025 16:09:03 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > >
> > > > > Use an exponential backoff timeout for data retransmission according
> > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > in Appendix A of RFC 6298.
> > > > >
> > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > >
> > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > ---
> > > > > tcp.c | 27 ++++++++++++---------------
> > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/tcp.c b/tcp.c
> > > > > index 9385132..dc0ec6c 100644
> > > > > --- a/tcp.c
> > > > > +++ b/tcp.c
> > > > > @@ -179,16 +179,14 @@
> > > > > *
> > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > *
> > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > - *
> > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > + * handshake.
> > > > > *
> > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > @@ -342,8 +340,7 @@ enum {
> > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > >
> > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > -#define ACK_TIMEOUT 2
> > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > #define FIN_TIMEOUT 60
> > > > > #define ACT_TIMEOUT 7200
> > > > >
> > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > else
> > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > }
> > > > > else
> > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > >
> > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > >
> > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> >
> > Why can't you do that? Say:
> >
> > it.it_value.tv_sec = RTO_INIT
> > if (!(conn->events & ESTABLISHED)) {
> > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > it.it_value.tv_sec <<= (conn->retries -
> > c->tcp.syn_linear_timeouts);
> >
> > but anyway, see below.
> >
> > > But it seems more complicated. What do you think?
> >
> > Or maybe, building on my latest comment to 3/4:
> >
> > int factor = conn->retries;
> >
> > if (!(conn->events & ESTABLISHED))
> > factor -= c->tcp.syn_linear_timeouts;
> >
> > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> >
> > ?
>
> Yeah, I understand this part now.
>
> >
> > > >
> > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > } else {
> > > >
> > > > The rest of the series looks good to me.
> > > >
> > > > It might be slightly more practical to factor in directly the RTO
> > > > clamp, and I don't think it's complicated now that you have the helper
> > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > makes sense in any case.
> > >
> > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > about the way we get the total time for retries.
> > >
> > > I used to do it like this in v2,
> > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > >
> > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > >
> > > Though the formula is not quite right, we could refine it as below:
> > >
> > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > >
> > > Does it make sense to get the time this way?
> >
> > Well, it also depends on c->tcp.syn_linear_timeouts, right?
>
> Not really, it's only used for data retransmission, so
> syn_linear_timeouts is not relevant.
Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
stating:
(2.5) A maximum value MAY be placed on RTO provided it is at least 60
seconds.
...the only thing that I don't see implemented in this version of the
patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the
TCP implementation is using an RTO less than 3 seconds, the RTO
MUST be re-initialized to 3 seconds when data transmission
begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use
a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
simpler than re-introducing separate starting values (one second and
three seconds).
> Probably I should name it more clearly.
>
> >
> > But in any case, do you really need to calculate that explicitly? I was
> > thinking that you can just clamp the value when you use it in
> > tcp_timer_ctl().
> >
> > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > you read from the kernel, then I guess it's just:
> >
> > ---
> > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
>
> After reading the comments in v3 when tcp_rto_max_ms was first
> mentioned again, I realized I got something wrong again. I thought it
> was for the total timeout for all retries, so I need to calculate that
> and decide to reset the connection as in v2.
I think it actually applies to all the retries.
> Anyway, you are right. We don't need to do that. Thanks for your patience.
>
> > } ...
> >
> > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > flow_perror(conn, "failed to set timer");
> > ---
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 4:38 ` Stefano Brivio
@ 2025-10-29 5:11 ` Yumei Huang
2025-10-29 7:09 ` Stefano Brivio
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-29 5:11 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Wed, 29 Oct 2025 11:06:44 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > >
> > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > >
> > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > in Appendix A of RFC 6298.
> > > > > >
> > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > >
> > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > ---
> > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > >
> > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > index 9385132..dc0ec6c 100644
> > > > > > --- a/tcp.c
> > > > > > +++ b/tcp.c
> > > > > > @@ -179,16 +179,14 @@
> > > > > > *
> > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > *
> > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > - *
> > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > + * handshake.
> > > > > > *
> > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > >
> > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > -#define ACK_TIMEOUT 2
> > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > #define FIN_TIMEOUT 60
> > > > > > #define ACT_TIMEOUT 7200
> > > > > >
> > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > else
> > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > }
> > > > > > else
> > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > >
> > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > >
> > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > >
> > > Why can't you do that? Say:
> > >
> > > it.it_value.tv_sec = RTO_INIT
> > > if (!(conn->events & ESTABLISHED)) {
> > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > it.it_value.tv_sec <<= (conn->retries -
> > > c->tcp.syn_linear_timeouts);
> > >
> > > but anyway, see below.
> > >
> > > > But it seems more complicated. What do you think?
> > >
> > > Or maybe, building on my latest comment to 3/4:
> > >
> > > int factor = conn->retries;
> > >
> > > if (!(conn->events & ESTABLISHED))
> > > factor -= c->tcp.syn_linear_timeouts;
> > >
> > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > >
> > > ?
> >
> > Yeah, I understand this part now.
> >
> > >
> > > > >
> > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > } else {
> > > > >
> > > > > The rest of the series looks good to me.
> > > > >
> > > > > It might be slightly more practical to factor in directly the RTO
> > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > makes sense in any case.
> > > >
> > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > about the way we get the total time for retries.
> > > >
> > > > I used to do it like this in v2,
> > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > >
> > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > >
> > > > Though the formula is not quite right, we could refine it as below:
> > > >
> > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > >
> > > > Does it make sense to get the time this way?
> > >
> > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> >
> > Not really, it's only used for data retransmission, so
> > syn_linear_timeouts is not relevant.
>
> Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> stating:
I meant RETRY_ELAPSED was only used for data retransmission, which
uses exponential backoff timeout directly, so syn_linear_timeouts was
not relevant.
>
> (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> seconds.
For SYN retries, as we used linear backoff + exponential backoff, and
also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
60s. So we didn't clamp it. Do you think we need to clamp it as well?
>
> ...the only thing that I don't see implemented in this version of the
> patch is paragraph 5.7:
>
> (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> TCP implementation is using an RTO less than 3 seconds, the RTO
> MUST be re-initialized to 3 seconds when data transmission
> begins (i.e., after the three-way handshake completes).
>
> I missed that while reviewing earlier versions. I guess we need to use
> a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> simpler than re-introducing separate starting values (one second and
> three seconds).
I'm not sure I understand here. If the timer expires, didn't we reset
the connection directly? We would never get to the data transmission
phase?
>
> > Probably I should name it more clearly.
> >
> > >
> > > But in any case, do you really need to calculate that explicitly? I was
> > > thinking that you can just clamp the value when you use it in
> > > tcp_timer_ctl().
> > >
> > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > you read from the kernel, then I guess it's just:
> > >
> > > ---
> > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> >
> > After reading the comments in v3 when tcp_rto_max_ms was first
> > mentioned again, I realized I got something wrong again. I thought it
> > was for the total timeout for all retries, so I need to calculate that
> > and decide to reset the connection as in v2.
>
> I think it actually applies to all the retries.
>
> > Anyway, you are right. We don't need to do that. Thanks for your patience.
> >
> > > } ...
> > >
> > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > flow_perror(conn, "failed to set timer");
> > > ---
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 5:11 ` Yumei Huang
@ 2025-10-29 7:09 ` Stefano Brivio
2025-10-29 7:32 ` Yumei Huang
0 siblings, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-29 7:09 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Wed, 29 Oct 2025 13:11:48 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Wed, 29 Oct 2025 11:06:44 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > >
> > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > >
> > > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > > in Appendix A of RFC 6298.
> > > > > > >
> > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > > >
> > > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > ---
> > > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > > >
> > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > index 9385132..dc0ec6c 100644
> > > > > > > --- a/tcp.c
> > > > > > > +++ b/tcp.c
> > > > > > > @@ -179,16 +179,14 @@
> > > > > > > *
> > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > > *
> > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > > - *
> > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > > + * handshake.
> > > > > > > *
> > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > > >
> > > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > > -#define ACK_TIMEOUT 2
> > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > > #define FIN_TIMEOUT 60
> > > > > > > #define ACT_TIMEOUT 7200
> > > > > > >
> > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > > else
> > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > > }
> > > > > > > else
> > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > > >
> > > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > > >
> > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > > >
> > > > Why can't you do that? Say:
> > > >
> > > > it.it_value.tv_sec = RTO_INIT
> > > > if (!(conn->events & ESTABLISHED)) {
> > > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > > it.it_value.tv_sec <<= (conn->retries -
> > > > c->tcp.syn_linear_timeouts);
> > > >
> > > > but anyway, see below.
> > > >
> > > > > But it seems more complicated. What do you think?
> > > >
> > > > Or maybe, building on my latest comment to 3/4:
> > > >
> > > > int factor = conn->retries;
> > > >
> > > > if (!(conn->events & ESTABLISHED))
> > > > factor -= c->tcp.syn_linear_timeouts;
> > > >
> > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > > >
> > > > ?
> > >
> > > Yeah, I understand this part now.
> > >
> > > >
> > > > > >
> > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > > } else {
> > > > > >
> > > > > > The rest of the series looks good to me.
> > > > > >
> > > > > > It might be slightly more practical to factor in directly the RTO
> > > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > > makes sense in any case.
> > > > >
> > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > > about the way we get the total time for retries.
> > > > >
> > > > > I used to do it like this in v2,
> > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > > >
> > > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > > >
> > > > > Though the formula is not quite right, we could refine it as below:
> > > > >
> > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > > >
> > > > > Does it make sense to get the time this way?
> > > >
> > > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> > >
> > > Not really, it's only used for data retransmission, so
> > > syn_linear_timeouts is not relevant.
> >
> > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> > stating:
>
> I meant RETRY_ELAPSED was only used for data retransmission, which
> uses exponential backoff timeout directly, so syn_linear_timeouts was
> not relevant.
Ah, okay.
> >
> > (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > seconds.
>
> For SYN retries, as we used linear backoff + exponential backoff, and
> also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
> 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll
reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or
just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
Don't we have 3 bits to store the retry count as well, so we're limited
by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same
counter.
> > ...the only thing that I don't see implemented in this version of the
> > patch is paragraph 5.7:
> >
> > (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> > TCP implementation is using an RTO less than 3 seconds, the RTO
> > MUST be re-initialized to 3 seconds when data transmission
> > begins (i.e., after the three-way handshake completes).
> >
> > I missed that while reviewing earlier versions. I guess we need to use
> > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> > simpler than re-introducing separate starting values (one second and
> > three seconds).
>
> I'm not sure I understand here. If the timer expires, didn't we reset
> the connection directly? We would never get to the data transmission
> phase?
In the RFC, "the timer expires" indicates one iteration of the timeout
algorithm, that is, it's the same as our timer (tcp_timer_handler())
triggering.
The paragraph says "after the three-way handshake completes", so it
looks like the connection wasn't reset.
> > > Probably I should name it more clearly.
> > >
> > > >
> > > > But in any case, do you really need to calculate that explicitly? I was
> > > > thinking that you can just clamp the value when you use it in
> > > > tcp_timer_ctl().
> > > >
> > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > > you read from the kernel, then I guess it's just:
> > > >
> > > > ---
> > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> > >
> > > After reading the comments in v3 when tcp_rto_max_ms was first
> > > mentioned again, I realized I got something wrong again. I thought it
> > > was for the total timeout for all retries, so I need to calculate that
> > > and decide to reset the connection as in v2.
> >
> > I think it actually applies to all the retries.
> >
> > > Anyway, you are right. We don't need to do that. Thanks for your patience.
> > >
> > > > } ...
> > > >
> > > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > > flow_perror(conn, "failed to set timer");
> > > > ---
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 7:09 ` Stefano Brivio
@ 2025-10-29 7:32 ` Yumei Huang
2025-10-29 7:39 ` Stefano Brivio
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-29 7:32 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Wed, 29 Oct 2025 13:11:48 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Wed, 29 Oct 2025 11:06:44 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > >
> > > > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > >
> > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > >
> > > > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > > > in Appendix A of RFC 6298.
> > > > > > > >
> > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > > > >
> > > > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > > ---
> > > > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > > index 9385132..dc0ec6c 100644
> > > > > > > > --- a/tcp.c
> > > > > > > > +++ b/tcp.c
> > > > > > > > @@ -179,16 +179,14 @@
> > > > > > > > *
> > > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > > > *
> > > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > > > - *
> > > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > > > + * handshake.
> > > > > > > > *
> > > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > > > >
> > > > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > > > -#define ACK_TIMEOUT 2
> > > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > > > #define FIN_TIMEOUT 60
> > > > > > > > #define ACT_TIMEOUT 7200
> > > > > > > >
> > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > > > else
> > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > > > }
> > > > > > > > else
> > > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > > > >
> > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > > > >
> > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > > > >
> > > > > Why can't you do that? Say:
> > > > >
> > > > > it.it_value.tv_sec = RTO_INIT
> > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > > > it.it_value.tv_sec <<= (conn->retries -
> > > > > c->tcp.syn_linear_timeouts);
> > > > >
> > > > > but anyway, see below.
> > > > >
> > > > > > But it seems more complicated. What do you think?
> > > > >
> > > > > Or maybe, building on my latest comment to 3/4:
> > > > >
> > > > > int factor = conn->retries;
> > > > >
> > > > > if (!(conn->events & ESTABLISHED))
> > > > > factor -= c->tcp.syn_linear_timeouts;
> > > > >
> > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > > > >
> > > > > ?
> > > >
> > > > Yeah, I understand this part now.
> > > >
> > > > >
> > > > > > >
> > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > > > } else {
> > > > > > >
> > > > > > > The rest of the series looks good to me.
> > > > > > >
> > > > > > > It might be slightly more practical to factor in directly the RTO
> > > > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > > > makes sense in any case.
> > > > > >
> > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > > > about the way we get the total time for retries.
> > > > > >
> > > > > > I used to do it like this in v2,
> > > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > > > >
> > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > > > >
> > > > > > Though the formula is not quite right, we could refine it as below:
> > > > > >
> > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > > > >
> > > > > > Does it make sense to get the time this way?
> > > > >
> > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> > > >
> > > > Not really, it's only used for data retransmission, so
> > > > syn_linear_timeouts is not relevant.
> > >
> > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> > > stating:
> >
> > I meant RETRY_ELAPSED was only used for data retransmission, which
> > uses exponential backoff timeout directly, so syn_linear_timeouts was
> > not relevant.
>
> Ah, okay.
>
> > >
> > > (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > > seconds.
> >
> > For SYN retries, as we used linear backoff + exponential backoff, and
> > also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
> > 60s. So we didn't clamp it. Do you think we need to clamp it as well?
>
> If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll
> reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or
> just up to ... + 2^6, that is, 119 seconds?
>
> In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4.
When it's 0, it's the same as data retransmissions.
>
> Don't we have 3 bits to store the retry count as well, so we're limited
> by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same
> counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
>
> > > ...the only thing that I don't see implemented in this version of the
> > > patch is paragraph 5.7:
> > >
> > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> > > TCP implementation is using an RTO less than 3 seconds, the RTO
> > > MUST be re-initialized to 3 seconds when data transmission
> > > begins (i.e., after the three-way handshake completes).
> > >
> > > I missed that while reviewing earlier versions. I guess we need to use
> > > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> > > simpler than re-introducing separate starting values (one second and
> > > three seconds).
> >
> > I'm not sure I understand here. If the timer expires, didn't we reset
> > the connection directly? We would never get to the data transmission
> > phase?
>
> In the RFC, "the timer expires" indicates one iteration of the timeout
> algorithm, that is, it's the same as our timer (tcp_timer_handler())
> triggering.
I see.
If you agree, I can add a new patch in this series to address the clamping.
>
> The paragraph says "after the three-way handshake completes", so it
> looks like the connection wasn't reset.
>
> > > > Probably I should name it more clearly.
> > > >
> > > > >
> > > > > But in any case, do you really need to calculate that explicitly? I was
> > > > > thinking that you can just clamp the value when you use it in
> > > > > tcp_timer_ctl().
> > > > >
> > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > > > you read from the kernel, then I guess it's just:
> > > > >
> > > > > ---
> > > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> > > >
> > > > After reading the comments in v3 when tcp_rto_max_ms was first
> > > > mentioned again, I realized I got something wrong again. I thought it
> > > > was for the total timeout for all retries, so I need to calculate that
> > > > and decide to reset the connection as in v2.
> > >
> > > I think it actually applies to all the retries.
> > >
> > > > Anyway, you are right. We don't need to do that. Thanks for your patience.
> > > >
> > > > > } ...
> > > > >
> > > > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > > > flow_perror(conn, "failed to set timer");
> > > > > ---
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 7:32 ` Yumei Huang
@ 2025-10-29 7:39 ` Stefano Brivio
2025-10-29 8:59 ` Yumei Huang
0 siblings, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-29 7:39 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Wed, 29 Oct 2025 15:32:33 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Wed, 29 Oct 2025 13:11:48 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Wed, 29 Oct 2025 11:06:44 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > >
> > > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > >
> > > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > > > > in Appendix A of RFC 6298.
> > > > > > > > >
> > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > > > ---
> > > > > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > > > index 9385132..dc0ec6c 100644
> > > > > > > > > --- a/tcp.c
> > > > > > > > > +++ b/tcp.c
> > > > > > > > > @@ -179,16 +179,14 @@
> > > > > > > > > *
> > > > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > > > > *
> > > > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > > > > - *
> > > > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > > > > + * handshake.
> > > > > > > > > *
> > > > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > > > > >
> > > > > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > > > > -#define ACK_TIMEOUT 2
> > > > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > > > > #define FIN_TIMEOUT 60
> > > > > > > > > #define ACT_TIMEOUT 7200
> > > > > > > > >
> > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > > > > else
> > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > > > > }
> > > > > > > > > else
> > > > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > > > > >
> > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > > > > >
> > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > > > > >
> > > > > > Why can't you do that? Say:
> > > > > >
> > > > > > it.it_value.tv_sec = RTO_INIT
> > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > > > > it.it_value.tv_sec <<= (conn->retries -
> > > > > > c->tcp.syn_linear_timeouts);
> > > > > >
> > > > > > but anyway, see below.
> > > > > >
> > > > > > > But it seems more complicated. What do you think?
> > > > > >
> > > > > > Or maybe, building on my latest comment to 3/4:
> > > > > >
> > > > > > int factor = conn->retries;
> > > > > >
> > > > > > if (!(conn->events & ESTABLISHED))
> > > > > > factor -= c->tcp.syn_linear_timeouts;
> > > > > >
> > > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > > > > >
> > > > > > ?
> > > > >
> > > > > Yeah, I understand this part now.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > > > > } else {
> > > > > > > >
> > > > > > > > The rest of the series looks good to me.
> > > > > > > >
> > > > > > > > It might be slightly more practical to factor in directly the RTO
> > > > > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > > > > makes sense in any case.
> > > > > > >
> > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > > > > about the way we get the total time for retries.
> > > > > > >
> > > > > > > I used to do it like this in v2,
> > > > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > > > > >
> > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > > > > >
> > > > > > > Though the formula is not quite right, we could refine it as below:
> > > > > > >
> > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > > > > >
> > > > > > > Does it make sense to get the time this way?
> > > > > >
> > > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> > > > >
> > > > > Not really, it's only used for data retransmission, so
> > > > > syn_linear_timeouts is not relevant.
> > > >
> > > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> > > > stating:
> > >
> > > I meant RETRY_ELAPSED was only used for data retransmission, which
> > > uses exponential backoff timeout directly, so syn_linear_timeouts was
> > > not relevant.
> >
> > Ah, okay.
> >
> > > >
> > > > (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > > > seconds.
> > >
> > > For SYN retries, as we used linear backoff + exponential backoff, and
> > > also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
> > > 60s. So we didn't clamp it. Do you think we need to clamp it as well?
> >
> > If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll
> > reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or
> > just up to ... + 2^6, that is, 119 seconds?
> >
> > In any case, what is the difference compared to data retransmissions?
>
> You are right. I assumed wrongly that syn_linear_timeouts is always 4.
> When it's 0, it's the same as data retransmissions.
> >
> > Don't we have 3 bits to store the retry count as well, so we're limited
> > by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same
> > counter.
>
> Yes, it's the same counter. I guess you mean we should clamp it as well.
I didn't check this part, I thought we already did, but if we don't, we
should do that, yes.
> > > > ...the only thing that I don't see implemented in this version of the
> > > > patch is paragraph 5.7:
> > > >
> > > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> > > > TCP implementation is using an RTO less than 3 seconds, the RTO
> > > > MUST be re-initialized to 3 seconds when data transmission
> > > > begins (i.e., after the three-way handshake completes).
> > > >
> > > > I missed that while reviewing earlier versions. I guess we need to use
> > > > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> > > > simpler than re-introducing separate starting values (one second and
> > > > three seconds).
> > >
> > > I'm not sure I understand here. If the timer expires, didn't we reset
> > > the connection directly? We would never get to the data transmission
> > > phase?
> >
> > In the RFC, "the timer expires" indicates one iteration of the timeout
> > algorithm, that is, it's the same as our timer (tcp_timer_handler())
> > triggering.
>
> I see.
>
> If you agree, I can add a new patch in this series to address the clamping.
I don't really have a preference, I guess it could be directly in 4/4 or
in a separate patch, neither option should complicate things too much.
> > The paragraph says "after the three-way handshake completes", so it
> > looks like the connection wasn't reset.
> >
> > > > > Probably I should name it more clearly.
> > > > >
> > > > > >
> > > > > > But in any case, do you really need to calculate that explicitly? I was
> > > > > > thinking that you can just clamp the value when you use it in
> > > > > > tcp_timer_ctl().
> > > > > >
> > > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > > > > you read from the kernel, then I guess it's just:
> > > > > >
> > > > > > ---
> > > > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> > > > >
> > > > > After reading the comments in v3 when tcp_rto_max_ms was first
> > > > > mentioned again, I realized I got something wrong again. I thought it
> > > > > was for the total timeout for all retries, so I need to calculate that
> > > > > and decide to reset the connection as in v2.
> > > >
> > > > I think it actually applies to all the retries.
> > > >
> > > > > Anyway, you are right. We don't need to do that. Thanks for your patience.
> > > > >
> > > > > > } ...
> > > > > >
> > > > > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > > > > flow_perror(conn, "failed to set timer");
> > > > > > ---
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 7:39 ` Stefano Brivio
@ 2025-10-29 8:59 ` Yumei Huang
2025-10-29 12:18 ` Stefano Brivio
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-29 8:59 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Wed, 29 Oct 2025 15:32:33 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Wed, 29 Oct 2025 13:11:48 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > >
> > > > > On Wed, 29 Oct 2025 11:06:44 +0800
> > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > >
> > > > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > >
> > > > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > > > > > in Appendix A of RFC 6298.
> > > > > > > > > >
> > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > > > > ---
> > > > > > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > > > > index 9385132..dc0ec6c 100644
> > > > > > > > > > --- a/tcp.c
> > > > > > > > > > +++ b/tcp.c
> > > > > > > > > > @@ -179,16 +179,14 @@
> > > > > > > > > > *
> > > > > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > > > > > *
> > > > > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > > > > > - *
> > > > > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > > > > > + * handshake.
> > > > > > > > > > *
> > > > > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > > > > > >
> > > > > > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > > > > > -#define ACK_TIMEOUT 2
> > > > > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > > > > > #define FIN_TIMEOUT 60
> > > > > > > > > > #define ACT_TIMEOUT 7200
> > > > > > > > > >
> > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > > > > > else
> > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > > > > > }
> > > > > > > > > > else
> > > > > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > > > > > >
> > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > > > > > >
> > > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > > > > > >
> > > > > > > Why can't you do that? Say:
> > > > > > >
> > > > > > > it.it_value.tv_sec = RTO_INIT
> > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > > > > > it.it_value.tv_sec <<= (conn->retries -
> > > > > > > c->tcp.syn_linear_timeouts);
> > > > > > >
> > > > > > > but anyway, see below.
> > > > > > >
> > > > > > > > But it seems more complicated. What do you think?
> > > > > > >
> > > > > > > Or maybe, building on my latest comment to 3/4:
> > > > > > >
> > > > > > > int factor = conn->retries;
> > > > > > >
> > > > > > > if (!(conn->events & ESTABLISHED))
> > > > > > > factor -= c->tcp.syn_linear_timeouts;
> > > > > > >
> > > > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > > > > > >
> > > > > > > ?
> > > > > >
> > > > > > Yeah, I understand this part now.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > > > > > } else {
> > > > > > > > >
> > > > > > > > > The rest of the series looks good to me.
> > > > > > > > >
> > > > > > > > > It might be slightly more practical to factor in directly the RTO
> > > > > > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > > > > > makes sense in any case.
> > > > > > > >
> > > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > > > > > about the way we get the total time for retries.
> > > > > > > >
> > > > > > > > I used to do it like this in v2,
> > > > > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > > > > > >
> > > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > > > > > >
> > > > > > > > Though the formula is not quite right, we could refine it as below:
> > > > > > > >
> > > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > > > > > >
> > > > > > > > Does it make sense to get the time this way?
> > > > > > >
> > > > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> > > > > >
> > > > > > Not really, it's only used for data retransmission, so
> > > > > > syn_linear_timeouts is not relevant.
> > > > >
> > > > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> > > > > stating:
> > > >
> > > > I meant RETRY_ELAPSED was only used for data retransmission, which
> > > > uses exponential backoff timeout directly, so syn_linear_timeouts was
> > > > not relevant.
> > >
> > > Ah, okay.
> > >
> > > > >
> > > > > (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > > > > seconds.
> > > >
> > > > For SYN retries, as we used linear backoff + exponential backoff, and
> > > > also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
> > > > 60s. So we didn't clamp it. Do you think we need to clamp it as well?
> > >
> > > If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll
> > > reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or
> > > just up to ... + 2^6, that is, 119 seconds?
> > >
> > > In any case, what is the difference compared to data retransmissions?
> >
> > You are right. I assumed wrongly that syn_linear_timeouts is always 4.
> > When it's 0, it's the same as data retransmissions.
> > >
> > > Don't we have 3 bits to store the retry count as well, so we're limited
> > > by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same
> > > counter.
> >
> > Yes, it's the same counter. I guess you mean we should clamp it as well.
>
> I didn't check this part, I thought we already did, but if we don't, we
> should do that, yes.
>
> > > > > ...the only thing that I don't see implemented in this version of the
> > > > > patch is paragraph 5.7:
> > > > >
> > > > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> > > > > TCP implementation is using an RTO less than 3 seconds, the RTO
> > > > > MUST be re-initialized to 3 seconds when data transmission
> > > > > begins (i.e., after the three-way handshake completes).
> > > > >
> > > > > I missed that while reviewing earlier versions. I guess we need to use
> > > > > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> > > > > simpler than re-introducing separate starting values (one second and
> > > > > three seconds).
Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if
(c->conn->events & ESTABLISHED)", I guess you mean:
it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
Then the code would be:
it.it_value.tv_sec = RTO_INIT;
if (!(conn->events & ESTABLISHED)) {
int exp = conn->retries - c->tcp.syn_linear_timeouts;
it.it_value.tv_sec <<= MAX(exp, 0);
} else {
it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
it.it_value.tv_sec <<= conn->retries;
}
it.it_value.tv_sec = MIN(
it.it_value.tv_sec, c->tcp.tcp_rto_max);
we basically set the starting value to 3 for data retransmissions no
matter if we have retried SYN. And the max total timeout would be
3+6+12+24+48+96+120+120 = 429s. Is it what we want? Besides, I guess I
need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
> > > >
> > > > I'm not sure I understand here. If the timer expires, didn't we reset
> > > > the connection directly? We would never get to the data transmission
> > > > phase?
> > >
> > > In the RFC, "the timer expires" indicates one iteration of the timeout
> > > algorithm, that is, it's the same as our timer (tcp_timer_handler())
> > > triggering.
> >
> > I see.
> >
> > If you agree, I can add a new patch in this series to address the clamping.
>
> I don't really have a preference, I guess it could be directly in 4/4 or
> in a separate patch, neither option should complicate things too much.
>
> > > The paragraph says "after the three-way handshake completes", so it
> > > looks like the connection wasn't reset.
> > >
> > > > > > Probably I should name it more clearly.
> > > > > >
> > > > > > >
> > > > > > > But in any case, do you really need to calculate that explicitly? I was
> > > > > > > thinking that you can just clamp the value when you use it in
> > > > > > > tcp_timer_ctl().
> > > > > > >
> > > > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > > > > > you read from the kernel, then I guess it's just:
> > > > > > >
> > > > > > > ---
> > > > > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> > > > > >
> > > > > > After reading the comments in v3 when tcp_rto_max_ms was first
> > > > > > mentioned again, I realized I got something wrong again. I thought it
> > > > > > was for the total timeout for all retries, so I need to calculate that
> > > > > > and decide to reset the connection as in v2.
> > > > >
> > > > > I think it actually applies to all the retries.
> > > > >
> > > > > > Anyway, you are right. We don't need to do that. Thanks for your patience.
> > > > > >
> > > > > > > } ...
> > > > > > >
> > > > > > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > > > > > flow_perror(conn, "failed to set timer");
> > > > > > > ---
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 8:59 ` Yumei Huang
@ 2025-10-29 12:18 ` Stefano Brivio
2025-10-30 8:25 ` Yumei Huang
0 siblings, 1 reply; 44+ messages in thread
From: Stefano Brivio @ 2025-10-29 12:18 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Wed, 29 Oct 2025 16:59:51 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Wed, 29 Oct 2025 15:32:33 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Wed, 29 Oct 2025 13:11:48 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > >
> > > > > On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, 29 Oct 2025 11:06:44 +0800
> > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > >
> > > > > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > > > > > > in Appendix A of RFC 6298.
> > > > > > > > > > >
> > > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > > > > > ---
> > > > > > > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > > > > > index 9385132..dc0ec6c 100644
> > > > > > > > > > > --- a/tcp.c
> > > > > > > > > > > +++ b/tcp.c
> > > > > > > > > > > @@ -179,16 +179,14 @@
> > > > > > > > > > > *
> > > > > > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > > > > > > *
> > > > > > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > > > > > > - *
> > > > > > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > > > > > > + * handshake.
> > > > > > > > > > > *
> > > > > > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > > > > > > >
> > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > > > > > > -#define ACK_TIMEOUT 2
> > > > > > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > > > > > > #define FIN_TIMEOUT 60
> > > > > > > > > > > #define ACT_TIMEOUT 7200
> > > > > > > > > > >
> > > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > > > > > > else
> > > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > > > > > > }
> > > > > > > > > > > else
> > > > > > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > > > > > > >
> > > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > > > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > > > > > > >
> > > > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > > > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > > > > > > >
> > > > > > > > Why can't you do that? Say:
> > > > > > > >
> > > > > > > > it.it_value.tv_sec = RTO_INIT
> > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > > > > > > it.it_value.tv_sec <<= (conn->retries -
> > > > > > > > c->tcp.syn_linear_timeouts);
> > > > > > > >
> > > > > > > > but anyway, see below.
> > > > > > > >
> > > > > > > > > But it seems more complicated. What do you think?
> > > > > > > >
> > > > > > > > Or maybe, building on my latest comment to 3/4:
> > > > > > > >
> > > > > > > > int factor = conn->retries;
> > > > > > > >
> > > > > > > > if (!(conn->events & ESTABLISHED))
> > > > > > > > factor -= c->tcp.syn_linear_timeouts;
> > > > > > > >
> > > > > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > > > > > > >
> > > > > > > > ?
> > > > > > >
> > > > > > > Yeah, I understand this part now.
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > > > > > > } else {
> > > > > > > > > >
> > > > > > > > > > The rest of the series looks good to me.
> > > > > > > > > >
> > > > > > > > > > It might be slightly more practical to factor in directly the RTO
> > > > > > > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > > > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > > > > > > makes sense in any case.
> > > > > > > > >
> > > > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > > > > > > about the way we get the total time for retries.
> > > > > > > > >
> > > > > > > > > I used to do it like this in v2,
> > > > > > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > > > > > > >
> > > > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > > > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > > > > > > >
> > > > > > > > > Though the formula is not quite right, we could refine it as below:
> > > > > > > > >
> > > > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > > > > > > >
> > > > > > > > > Does it make sense to get the time this way?
> > > > > > > >
> > > > > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> > > > > > >
> > > > > > > Not really, it's only used for data retransmission, so
> > > > > > > syn_linear_timeouts is not relevant.
> > > > > >
> > > > > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> > > > > > stating:
> > > > >
> > > > > I meant RETRY_ELAPSED was only used for data retransmission, which
> > > > > uses exponential backoff timeout directly, so syn_linear_timeouts was
> > > > > not relevant.
> > > >
> > > > Ah, okay.
> > > >
> > > > > >
> > > > > > (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > > > > > seconds.
> > > > >
> > > > > For SYN retries, as we used linear backoff + exponential backoff, and
> > > > > also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
> > > > > 60s. So we didn't clamp it. Do you think we need to clamp it as well?
> > > >
> > > > If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll
> > > > reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or
> > > > just up to ... + 2^6, that is, 119 seconds?
> > > >
> > > > In any case, what is the difference compared to data retransmissions?
> > >
> > > You are right. I assumed wrongly that syn_linear_timeouts is always 4.
> > > When it's 0, it's the same as data retransmissions.
> > > >
> > > > Don't we have 3 bits to store the retry count as well, so we're limited
> > > > by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same
> > > > counter.
> > >
> > > Yes, it's the same counter. I guess you mean we should clamp it as well.
> >
> > I didn't check this part, I thought we already did, but if we don't, we
> > should do that, yes.
> >
> > > > > > ...the only thing that I don't see implemented in this version of the
> > > > > > patch is paragraph 5.7:
> > > > > >
> > > > > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> > > > > > TCP implementation is using an RTO less than 3 seconds, the RTO
> > > > > > MUST be re-initialized to 3 seconds when data transmission
> > > > > > begins (i.e., after the three-way handshake completes).
> > > > > >
> > > > > > I missed that while reviewing earlier versions. I guess we need to use
> > > > > > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> > > > > > simpler than re-introducing separate starting values (one second and
> > > > > > three seconds).
>
> Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if
> (c->conn->events & ESTABLISHED)", I guess you mean:
>
> it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
Right.
> Then the code would be:
>
> it.it_value.tv_sec = RTO_INIT;
> if (!(conn->events & ESTABLISHED)) {
> int exp = conn->retries - c->tcp.syn_linear_timeouts;
> it.it_value.tv_sec <<= MAX(exp, 0);
> } else {
> it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
> it.it_value.tv_sec <<= conn->retries;
Hmm what's wrong with my previous suggestion of initialising 'exp' with
conn->retries? Then it becomes (equivalent to your code snippet, but
shorter):
> }
> it.it_value.tv_sec = MIN(
> it.it_value.tv_sec, c->tcp.tcp_rto_max);
something like:
int exp = conn->retries, min = 0, max = c->tcp.tcp_rto_max;
if (!(conn->events & ESTABLISHED))
exp -= c->tcp.syn_linear_timeouts;
else
min = /* add a constant for this */ 3;
it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
it.it_value.tv_sec = CLAMP(it.it_value.tv_sec, min, max);
...we don't have a CLAMP() macro at the moment, just MIN() and MAX() in
util.h, but now that we need one, I would add it, similar to the one
from the Linux kernel but without type checking (as it's not really
practical here).
Inspired from include/linux/kernel.h (current Linux kernel tree):
#define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)
we can do something similar with two differences: 1. all our macros are
uppercase (we don't have as many as the kernel, and it's nice to know
that they are macros from the name) and 2. as I mentioned we don't
need/want type checking, so, say:
#define CLAMP(x, min, max) MIN(MAX((x), (min)), (max))
...or keep val / lo / hi if it's clearer, I don't really have a
preference.
I added parentheses around the arguments by the way because I think
it's good practice, even though not needed in this case. It's the
PRE02-C recommendation in CERT C (we generally stick to those
recommendations whenever practical):
https://wiki.sei.cmu.edu/confluence/display/c/PRE02-C.+Macro+replacement+lists+should+be+parenthesized
https://clang.llvm.org/extra/clang-tidy/checks/bugprone/macro-parentheses.html
Note that the Linux kernel has a compatible license (it's the same,
actually, GPLv2+), and, regardless of that, the implementation is
trivial and the idea is obvious, so I don't think we need to give
explicit attribution in this case.
But in other cases we do, or I guess it's fair to the author anyway,
see for example siphash.h.
> we basically set the starting value to 3 for data retransmissions no
> matter if we have retried SYN. And the max total timeout would be
> 3+6+12+24+48+96+120+120 = 429s. Is it what we want?
Ah, yes. In the discussion I previously assumed that the default clamp
value was 60 seconds, but it's actually 120 seconds. It looks correct
to me.
> Besides, I guess I
> need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
Or RTO_INIT_ACK? Maybe RTO_INIT_DATA? I'm thinking that you already
have a RTO_INIT at this point, and this constant is very closely related
to that one.
> > > > >
> > > > > I'm not sure I understand here. If the timer expires, didn't we reset
> > > > > the connection directly? We would never get to the data transmission
> > > > > phase?
> > > >
> > > > In the RFC, "the timer expires" indicates one iteration of the timeout
> > > > algorithm, that is, it's the same as our timer (tcp_timer_handler())
> > > > triggering.
> > >
> > > I see.
> > >
> > > If you agree, I can add a new patch in this series to address the clamping.
> >
> > I don't really have a preference, I guess it could be directly in 4/4 or
> > in a separate patch, neither option should complicate things too much.
> >
> > > > The paragraph says "after the three-way handshake completes", so it
> > > > looks like the connection wasn't reset.
> > > >
> > > > > > > Probably I should name it more clearly.
> > > > > > >
> > > > > > > >
> > > > > > > > But in any case, do you really need to calculate that explicitly? I was
> > > > > > > > thinking that you can just clamp the value when you use it in
> > > > > > > > tcp_timer_ctl().
> > > > > > > >
> > > > > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > > > > > > you read from the kernel, then I guess it's just:
> > > > > > > >
> > > > > > > > ---
> > > > > > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> > > > > > >
> > > > > > > After reading the comments in v3 when tcp_rto_max_ms was first
> > > > > > > mentioned again, I realized I got something wrong again. I thought it
> > > > > > > was for the total timeout for all retries, so I need to calculate that
> > > > > > > and decide to reset the connection as in v2.
> > > > > >
> > > > > > I think it actually applies to all the retries.
> > > > > >
> > > > > > > Anyway, you are right. We don't need to do that. Thanks for your patience.
> > > > > > >
> > > > > > > > } ...
> > > > > > > >
> > > > > > > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > > > > > > flow_perror(conn, "failed to set timer");
> > > > > > > > ---
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-29 12:18 ` Stefano Brivio
@ 2025-10-30 8:25 ` Yumei Huang
2025-10-30 8:51 ` Stefano Brivio
0 siblings, 1 reply; 44+ messages in thread
From: Yumei Huang @ 2025-10-30 8:25 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, david
On Wed, Oct 29, 2025 at 8:18 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Wed, 29 Oct 2025 16:59:51 +0800
> Yumei Huang <yuhuang@redhat.com> wrote:
>
> > On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Wed, 29 Oct 2025 15:32:33 +0800
> > > Yumei Huang <yuhuang@redhat.com> wrote:
> > >
> > > > On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > >
> > > > > On Wed, 29 Oct 2025 13:11:48 +0800
> > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > >
> > > > > > On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > >
> > > > > > > On Wed, 29 Oct 2025 11:06:44 +0800
> > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > >
> > > > > > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > > > > > > > in Appendix A of RFC 6298.
> > > > > > > > > > > >
> > > > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > > > > > > ---
> > > > > > > > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > > > > > > index 9385132..dc0ec6c 100644
> > > > > > > > > > > > --- a/tcp.c
> > > > > > > > > > > > +++ b/tcp.c
> > > > > > > > > > > > @@ -179,16 +179,14 @@
> > > > > > > > > > > > *
> > > > > > > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > > > > > > > *
> > > > > > > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > > > > > > > - *
> > > > > > > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > > > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > > > > > > > + * handshake.
> > > > > > > > > > > > *
> > > > > > > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > > > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > > > > > > > >
> > > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > > > > > > > -#define ACK_TIMEOUT 2
> > > > > > > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > > > > > > > #define FIN_TIMEOUT 60
> > > > > > > > > > > > #define ACT_TIMEOUT 7200
> > > > > > > > > > > >
> > > > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > > > > > > > else
> > > > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > > > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > > > > > > > }
> > > > > > > > > > > > else
> > > > > > > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > > > > > > > >
> > > > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > > > > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > > > > > > > >
> > > > > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > > > > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > > > > > > > >
> > > > > > > > > Why can't you do that? Say:
> > > > > > > > >
> > > > > > > > > it.it_value.tv_sec = RTO_INIT
> > > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > > > > > > > it.it_value.tv_sec <<= (conn->retries -
> > > > > > > > > c->tcp.syn_linear_timeouts);
> > > > > > > > >
> > > > > > > > > but anyway, see below.
> > > > > > > > >
> > > > > > > > > > But it seems more complicated. What do you think?
> > > > > > > > >
> > > > > > > > > Or maybe, building on my latest comment to 3/4:
> > > > > > > > >
> > > > > > > > > int factor = conn->retries;
> > > > > > > > >
> > > > > > > > > if (!(conn->events & ESTABLISHED))
> > > > > > > > > factor -= c->tcp.syn_linear_timeouts;
> > > > > > > > >
> > > > > > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > > > > > > > >
> > > > > > > > > ?
> > > > > > > >
> > > > > > > > Yeah, I understand this part now.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > > > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > > > > > > > } else {
> > > > > > > > > > >
> > > > > > > > > > > The rest of the series looks good to me.
> > > > > > > > > > >
> > > > > > > > > > > It might be slightly more practical to factor in directly the RTO
> > > > > > > > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > > > > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > > > > > > > makes sense in any case.
> > > > > > > > > >
> > > > > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > > > > > > > about the way we get the total time for retries.
> > > > > > > > > >
> > > > > > > > > > I used to do it like this in v2,
> > > > > > > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > > > > > > > >
> > > > > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > > > > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > > > > > > > >
> > > > > > > > > > Though the formula is not quite right, we could refine it as below:
> > > > > > > > > >
> > > > > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > > > > > > > >
> > > > > > > > > > Does it make sense to get the time this way?
> > > > > > > > >
> > > > > > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> > > > > > > >
> > > > > > > > Not really, it's only used for data retransmission, so
> > > > > > > > syn_linear_timeouts is not relevant.
> > > > > > >
> > > > > > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> > > > > > > stating:
> > > > > >
> > > > > > I meant RETRY_ELAPSED was only used for data retransmission, which
> > > > > > uses exponential backoff timeout directly, so syn_linear_timeouts was
> > > > > > not relevant.
> > > > >
> > > > > Ah, okay.
> > > > >
> > > > > > >
> > > > > > > (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > > > > > > seconds.
> > > > > >
> > > > > > For SYN retries, as we used linear backoff + exponential backoff, and
> > > > > > also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
> > > > > > 60s. So we didn't clamp it. Do you think we need to clamp it as well?
> > > > >
> > > > > If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll
> > > > > reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or
> > > > > just up to ... + 2^6, that is, 119 seconds?
> > > > >
> > > > > In any case, what is the difference compared to data retransmissions?
> > > >
> > > > You are right. I assumed wrongly that syn_linear_timeouts is always 4.
> > > > When it's 0, it's the same as data retransmissions.
> > > > >
> > > > > Don't we have 3 bits to store the retry count as well, so we're limited
> > > > > by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same
> > > > > counter.
> > > >
> > > > Yes, it's the same counter. I guess you mean we should clamp it as well.
> > >
> > > I didn't check this part, I thought we already did, but if we don't, we
> > > should do that, yes.
> > >
> > > > > > > ...the only thing that I don't see implemented in this version of the
> > > > > > > patch is paragraph 5.7:
> > > > > > >
> > > > > > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> > > > > > > TCP implementation is using an RTO less than 3 seconds, the RTO
> > > > > > > MUST be re-initialized to 3 seconds when data transmission
> > > > > > > begins (i.e., after the three-way handshake completes).
> > > > > > >
> > > > > > > I missed that while reviewing earlier versions. I guess we need to use
> > > > > > > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> > > > > > > simpler than re-introducing separate starting values (one second and
> > > > > > > three seconds).
> >
> > Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if
> > (c->conn->events & ESTABLISHED)", I guess you mean:
> >
> > it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
>
> Right.
>
> > Then the code would be:
> >
> > it.it_value.tv_sec = RTO_INIT;
> > if (!(conn->events & ESTABLISHED)) {
> > int exp = conn->retries - c->tcp.syn_linear_timeouts;
> > it.it_value.tv_sec <<= MAX(exp, 0);
> > } else {
> > it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
> > it.it_value.tv_sec <<= conn->retries;
>
> Hmm what's wrong with my previous suggestion of initialising 'exp' with
> conn->retries? Then it becomes (equivalent to your code snippet, but
> shorter):
Sorry, I read the comments several times but still missed that. And
yes, it's shorter.
>
> > }
> > it.it_value.tv_sec = MIN(
> > it.it_value.tv_sec, c->tcp.tcp_rto_max);
>
> something like:
>
> int exp = conn->retries, min = 0, max = c->tcp.tcp_rto_max;
>
> if (!(conn->events & ESTABLISHED))
> exp -= c->tcp.syn_linear_timeouts;
> else
> min = /* add a constant for this */ 3;
>
> it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
>
> it.it_value.tv_sec = CLAMP(it.it_value.tv_sec, min, max);
But this seems not right. For data retransmission, only the first
timeout will be changed from 1 to 3s, then the timeout becomes 4, 8,
16 ..., 120.
I feel we don't need CLAMP, the code would be:
#define RTO_INIT_ACK 3
...
int exp = conn->retries;
it.it_value.tv_sec = RTO_INIT;
if (!(conn->events & ESTABLISHED))
exp -= c->tcp.syn_linear_timeouts;
else {
it.it_value.tv_sec = MAX(it.it_value.tv_sec,
RTO_INIT_ACK);
}
it.it_value.tv_sec <<= MAX(exp, 0);
it.it_value.tv_sec = MIN(
it.it_value.tv_sec, c->tcp.tcp_rto_max);
Please correct me if I got anything wrong. Thanks.
>
> ...we don't have a CLAMP() macro at the moment, just MIN() and MAX() in
> util.h, but now that we need one, I would add it, similar to the one
> from the Linux kernel but without type checking (as it's not really
> practical here).
>
> Inspired from include/linux/kernel.h (current Linux kernel tree):
>
> #define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)
>
> we can do something similar with two differences: 1. all our macros are
> uppercase (we don't have as many as the kernel, and it's nice to know
> that they are macros from the name) and 2. as I mentioned we don't
> need/want type checking, so, say:
>
> #define CLAMP(x, min, max) MIN(MAX((x), (min)), (max))
>
> ...or keep val / lo / hi if it's clearer, I don't really have a
> preference.
>
> I added parentheses around the arguments by the way because I think
> it's good practice, even though not needed in this case. It's the
> PRE02-C recommendation in CERT C (we generally stick to those
> recommendations whenever practical):
>
> https://wiki.sei.cmu.edu/confluence/display/c/PRE02-C.+Macro+replacement+lists+should+be+parenthesized
>
> https://clang.llvm.org/extra/clang-tidy/checks/bugprone/macro-parentheses.html
>
> Note that the Linux kernel has a compatible license (it's the same,
> actually, GPLv2+), and, regardless of that, the implementation is
> trivial and the idea is obvious, so I don't think we need to give
> explicit attribution in this case.
>
> But in other cases we do, or I guess it's fair to the author anyway,
> see for example siphash.h.
>
> > we basically set the starting value to 3 for data retransmissions no
> > matter if we have retried SYN. And the max total timeout would be
> > 3+6+12+24+48+96+120+120 = 429s. Is it what we want?
>
> Ah, yes. In the discussion I previously assumed that the default clamp
> value was 60 seconds, but it's actually 120 seconds. It looks correct
> to me.
>
> > Besides, I guess I
> > need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
>
> Or RTO_INIT_ACK? Maybe RTO_INIT_DATA? I'm thinking that you already
> have a RTO_INIT at this point, and this constant is very closely related
> to that one.
>
> > > > > >
> > > > > > I'm not sure I understand here. If the timer expires, didn't we reset
> > > > > > the connection directly? We would never get to the data transmission
> > > > > > phase?
> > > > >
> > > > > In the RFC, "the timer expires" indicates one iteration of the timeout
> > > > > algorithm, that is, it's the same as our timer (tcp_timer_handler())
> > > > > triggering.
> > > >
> > > > I see.
> > > >
> > > > If you agree, I can add a new patch in this series to address the clamping.
> > >
> > > I don't really have a preference, I guess it could be directly in 4/4 or
> > > in a separate patch, neither option should complicate things too much.
> > >
> > > > > The paragraph says "after the three-way handshake completes", so it
> > > > > looks like the connection wasn't reset.
> > > > >
> > > > > > > > Probably I should name it more clearly.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > But in any case, do you really need to calculate that explicitly? I was
> > > > > > > > > thinking that you can just clamp the value when you use it in
> > > > > > > > > tcp_timer_ctl().
> > > > > > > > >
> > > > > > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > > > > > > > you read from the kernel, then I guess it's just:
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> > > > > > > >
> > > > > > > > After reading the comments in v3 when tcp_rto_max_ms was first
> > > > > > > > mentioned again, I realized I got something wrong again. I thought it
> > > > > > > > was for the total timeout for all retries, so I need to calculate that
> > > > > > > > and decide to reset the connection as in v2.
> > > > > > >
> > > > > > > I think it actually applies to all the retries.
> > > > > > >
> > > > > > > > Anyway, you are right. We don't need to do that. Thanks for your patience.
> > > > > > > >
> > > > > > > > > } ...
> > > > > > > > >
> > > > > > > > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > > > > > > > flow_perror(conn, "failed to set timer");
> > > > > > > > > ---
>
> --
> Stefano
>
--
Thanks,
Yumei Huang
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v6 4/4] tcp: Update data retransmission timeout
2025-10-30 8:25 ` Yumei Huang
@ 2025-10-30 8:51 ` Stefano Brivio
0 siblings, 0 replies; 44+ messages in thread
From: Stefano Brivio @ 2025-10-30 8:51 UTC (permalink / raw)
To: Yumei Huang; +Cc: passt-dev, david
On Thu, 30 Oct 2025 16:25:53 +0800
Yumei Huang <yuhuang@redhat.com> wrote:
> On Wed, Oct 29, 2025 at 8:18 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Wed, 29 Oct 2025 16:59:51 +0800
> > Yumei Huang <yuhuang@redhat.com> wrote:
> >
> > > On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Wed, 29 Oct 2025 15:32:33 +0800
> > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > >
> > > > > On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, 29 Oct 2025 13:11:48 +0800
> > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > >
> > > > > > > On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 29 Oct 2025 11:06:44 +0800
> > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 28 Oct 2025 16:09:03 +0800
> > > > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800
> > > > > > > > > > > > Yumei Huang <yuhuang@redhat.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Use an exponential backoff timeout for data retransmission according
> > > > > > > > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
> > > > > > > > > > > > > in Appendix A of RFC 6298.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > > > > > > > > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > tcp.c | 27 ++++++++++++---------------
> > > > > > > > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > > > > > > > > index 9385132..dc0ec6c 100644
> > > > > > > > > > > > > --- a/tcp.c
> > > > > > > > > > > > > +++ b/tcp.c
> > > > > > > > > > > > > @@ -179,16 +179,14 @@
> > > > > > > > > > > > > *
> > > > > > > > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > > > > > > > > > *
> > > > > > > > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > > > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > > > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > > > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > > > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > > > > > > > > - *
> > > > > > > > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > > > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > > > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for
> > > > > > > > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > > > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during
> > > > > > > > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > > > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > > > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the
> > > > > > > > > > > > > + * timeout for the first retry, in seconds. If this persists too many times
> > > > > > > > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established
> > > > > > > > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the
> > > > > > > > > > > > > + * handshake.
> > > > > > > > > > > > > *
> > > > > > > > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > > > > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > > > > > > > > @@ -342,8 +340,7 @@ enum {
> > > > > > > > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> > > > > > > > > > > > >
> > > > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */
> > > > > > > > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */
> > > > > > > > > > > > > -#define ACK_TIMEOUT 2
> > > > > > > > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */
> > > > > > > > > > > > > #define FIN_TIMEOUT 60
> > > > > > > > > > > > > #define ACT_TIMEOUT 7200
> > > > > > > > > > > > >
> > > > > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > > > > > > > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > > > > > > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts)
> > > > > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT;
> > > > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT;
> > > > > > > > > > > > > else
> > > > > > > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT <<
> > > > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT <<
> > > > > > > > > > > > > (conn->retries - c->tcp.syn_linear_timeouts);
> > > > > > > > > > > > > }
> > > > > > > > > > > > > else
> > > > > > > > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT;
> > > > > > > > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries;
> > > > > > > > > > > >
> > > > > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign
> > > > > > > > > > > > RTO_INIT, and multiply as needed in the if / else clauses.
> > > > > > > > > > >
> > > > > > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when
> > > > > > > > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
> > > > > > > > > >
> > > > > > > > > > Why can't you do that? Say:
> > > > > > > > > >
> > > > > > > > > > it.it_value.tv_sec = RTO_INIT
> > > > > > > > > > if (!(conn->events & ESTABLISHED)) {
> > > > > > > > > > if (conn->retries >= c->tcp.syn_linear_timeouts)
> > > > > > > > > > it.it_value.tv_sec <<= (conn->retries -
> > > > > > > > > > c->tcp.syn_linear_timeouts);
> > > > > > > > > >
> > > > > > > > > > but anyway, see below.
> > > > > > > > > >
> > > > > > > > > > > But it seems more complicated. What do you think?
> > > > > > > > > >
> > > > > > > > > > Or maybe, building on my latest comment to 3/4:
> > > > > > > > > >
> > > > > > > > > > int factor = conn->retries;
> > > > > > > > > >
> > > > > > > > > > if (!(conn->events & ESTABLISHED))
> > > > > > > > > > factor -= c->tcp.syn_linear_timeouts;
> > > > > > > > > >
> > > > > > > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
> > > > > > > > > >
> > > > > > > > > > ?
> > > > > > > > >
> > > > > > > > > Yeah, I understand this part now.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
> > > > > > > > > > > > > it.it_value.tv_sec = FIN_TIMEOUT;
> > > > > > > > > > > > > } else {
> > > > > > > > > > > >
> > > > > > > > > > > > The rest of the series looks good to me.
> > > > > > > > > > > >
> > > > > > > > > > > > It might be slightly more practical to factor in directly the RTO
> > > > > > > > > > > > clamp, and I don't think it's complicated now that you have the helper
> > > > > > > > > > > > from 2/4, but it's not a strong preference from my side, as the series
> > > > > > > > > > > > makes sense in any case.
> > > > > > > > > > >
> > > > > > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is
> > > > > > > > > > > about the way we get the total time for retries.
> > > > > > > > > > >
> > > > > > > > > > > I used to do it like this in v2,
> > > > > > > > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.com/:
> > > > > > > > > > >
> > > > > > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \
> > > > > > > > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
> > > > > > > > > > >
> > > > > > > > > > > Though the formula is not quite right, we could refine it as below:
> > > > > > > > > > >
> > > > > > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
> > > > > > > > > > >
> > > > > > > > > > > Does it make sense to get the time this way?
> > > > > > > > > >
> > > > > > > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
> > > > > > > > >
> > > > > > > > > Not really, it's only used for data retransmission, so
> > > > > > > > > syn_linear_timeouts is not relevant.
> > > > > > > >
> > > > > > > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one
> > > > > > > > stating:
> > > > > > >
> > > > > > > I meant RETRY_ELAPSED was only used for data retransmission, which
> > > > > > > uses exponential backoff timeout directly, so syn_linear_timeouts was
> > > > > > > not relevant.
> > > > > >
> > > > > > Ah, okay.
> > > > > >
> > > > > > > >
> > > > > > > > (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > > > > > > > seconds.
> > > > > > >
> > > > > > > For SYN retries, as we used linear backoff + exponential backoff, and
> > > > > > > also limited by TCP_MAX_RETRIES, the possible max RTO is far less than
> > > > > > > 60s. So we didn't clamp it. Do you think we need to clamp it as well?
> > > > > >
> > > > > > If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll
> > > > > > reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or
> > > > > > just up to ... + 2^6, that is, 119 seconds?
> > > > > >
> > > > > > In any case, what is the difference compared to data retransmissions?
> > > > >
> > > > > You are right. I assumed wrongly that syn_linear_timeouts is always 4.
> > > > > When it's 0, it's the same as data retransmissions.
> > > > > >
> > > > > > Don't we have 3 bits to store the retry count as well, so we're limited
> > > > > > by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same
> > > > > > counter.
> > > > >
> > > > > Yes, it's the same counter. I guess you mean we should clamp it as well.
> > > >
> > > > I didn't check this part, I thought we already did, but if we don't, we
> > > > should do that, yes.
> > > >
> > > > > > > > ...the only thing that I don't see implemented in this version of the
> > > > > > > > patch is paragraph 5.7:
> > > > > > > >
> > > > > > > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the
> > > > > > > > TCP implementation is using an RTO less than 3 seconds, the RTO
> > > > > > > > MUST be re-initialized to 3 seconds when data transmission
> > > > > > > > begins (i.e., after the three-way handshake completes).
> > > > > > > >
> > > > > > > > I missed that while reviewing earlier versions. I guess we need to use
> > > > > > > > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's
> > > > > > > > simpler than re-introducing separate starting values (one second and
> > > > > > > > three seconds).
> > >
> > > Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if
> > > (c->conn->events & ESTABLISHED)", I guess you mean:
> > >
> > > it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
> >
> > Right.
> >
> > > Then the code would be:
> > >
> > > it.it_value.tv_sec = RTO_INIT;
> > > if (!(conn->events & ESTABLISHED)) {
> > > int exp = conn->retries - c->tcp.syn_linear_timeouts;
> > > it.it_value.tv_sec <<= MAX(exp, 0);
> > > } else {
> > > it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
> > > it.it_value.tv_sec <<= conn->retries;
> >
> > Hmm what's wrong with my previous suggestion of initialising 'exp' with
> > conn->retries? Then it becomes (equivalent to your code snippet, but
> > shorter):
>
> Sorry, I read the comments several times but still missed that. And
> yes, it's shorter.
>
> >
> > > }
> > > it.it_value.tv_sec = MIN(
> > > it.it_value.tv_sec, c->tcp.tcp_rto_max);
> >
> > something like:
> >
> > int exp = conn->retries, min = 0, max = c->tcp.tcp_rto_max;
> >
> > if (!(conn->events & ESTABLISHED))
> > exp -= c->tcp.syn_linear_timeouts;
> > else
> > min = /* add a constant for this */ 3;
> >
> > it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
> >
> > it.it_value.tv_sec = CLAMP(it.it_value.tv_sec, min, max);
>
> But this seems not right. For data retransmission, only the first
> timeout will be changed from 1 to 3s, then the timeout becomes 4, 8,
> 16 ..., 120.
Oops, sorry, you're right.
I proposed that because I had for a moment the thought that
it.it_value.tv_sec would be just multiplied from the current value in
the next iterations, but it's of course not the case.
> I feel we don't need CLAMP, the code would be:
>
> #define RTO_INIT_ACK 3
> ...
> int exp = conn->retries;
> it.it_value.tv_sec = RTO_INIT;
> if (!(conn->events & ESTABLISHED))
> exp -= c->tcp.syn_linear_timeouts;
> else {
> it.it_value.tv_sec = MAX(it.it_value.tv_sec,
> RTO_INIT_ACK);
> }
> it.it_value.tv_sec <<= MAX(exp, 0);
> it.it_value.tv_sec = MIN(
> it.it_value.tv_sec, c->tcp.tcp_rto_max);
>
> Please correct me if I got anything wrong. Thanks.
It looks good to me now.
Maybe try keeping 'min' or 'max' values assigned separately as I did
and see if it helps readability, though.
*Or* use a temporary 'timeout' variable and assign it.it_value.tv_sec
only at the end, so that lines are shorter and more readable. Probably
better than any 'min' or 'max' which force the reader to look at yet
another place.
But conceptually I think it should be like that, yes.
> > ...we don't have a CLAMP() macro at the moment, just MIN() and MAX() in
> > util.h, but now that we need one, I would add it, similar to the one
> > from the Linux kernel but without type checking (as it's not really
> > practical here).
> >
> > Inspired from include/linux/kernel.h (current Linux kernel tree):
> >
> > #define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)
> >
> > we can do something similar with two differences: 1. all our macros are
> > uppercase (we don't have as many as the kernel, and it's nice to know
> > that they are macros from the name) and 2. as I mentioned we don't
> > need/want type checking, so, say:
> >
> > #define CLAMP(x, min, max) MIN(MAX((x), (min)), (max))
> >
> > ...or keep val / lo / hi if it's clearer, I don't really have a
> > preference.
> >
> > I added parentheses around the arguments by the way because I think
> > it's good practice, even though not needed in this case. It's the
> > PRE02-C recommendation in CERT C (we generally stick to those
> > recommendations whenever practical):
> >
> > https://wiki.sei.cmu.edu/confluence/display/c/PRE02-C.+Macro+replacement+lists+should+be+parenthesized
> >
> > https://clang.llvm.org/extra/clang-tidy/checks/bugprone/macro-parentheses.html
> >
> > Note that the Linux kernel has a compatible license (it's the same,
> > actually, GPLv2+), and, regardless of that, the implementation is
> > trivial and the idea is obvious, so I don't think we need to give
> > explicit attribution in this case.
> >
> > But in other cases we do, or I guess it's fair to the author anyway,
> > see for example siphash.h.
> >
> > > we basically set the starting value to 3 for data retransmissions no
> > > matter if we have retried SYN. And the max total timeout would be
> > > 3+6+12+24+48+96+120+120 = 429s. Is it what we want?
> >
> > Ah, yes. In the discussion I previously assumed that the default clamp
> > value was 60 seconds, but it's actually 120 seconds. It looks correct
> > to me.
> >
> > > Besides, I guess I
> > > need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
> >
> > Or RTO_INIT_ACK? Maybe RTO_INIT_DATA? I'm thinking that you already
> > have a RTO_INIT at this point, and this constant is very closely related
> > to that one.
> >
> > > > > > >
> > > > > > > I'm not sure I understand here. If the timer expires, didn't we reset
> > > > > > > the connection directly? We would never get to the data transmission
> > > > > > > phase?
> > > > > >
> > > > > > In the RFC, "the timer expires" indicates one iteration of the timeout
> > > > > > algorithm, that is, it's the same as our timer (tcp_timer_handler())
> > > > > > triggering.
> > > > >
> > > > > I see.
> > > > >
> > > > > If you agree, I can add a new patch in this series to address the clamping.
> > > >
> > > > I don't really have a preference, I guess it could be directly in 4/4 or
> > > > in a separate patch, neither option should complicate things too much.
> > > >
> > > > > > The paragraph says "after the three-way handshake completes", so it
> > > > > > looks like the connection wasn't reset.
> > > > > >
> > > > > > > > > Probably I should name it more clearly.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > But in any case, do you really need to calculate that explicitly? I was
> > > > > > > > > > thinking that you can just clamp the value when you use it in
> > > > > > > > > > tcp_timer_ctl().
> > > > > > > > > >
> > > > > > > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value
> > > > > > > > > > you read from the kernel, then I guess it's just:
> > > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
> > > > > > > > >
> > > > > > > > > After reading the comments in v3 when tcp_rto_max_ms was first
> > > > > > > > > mentioned again, I realized I got something wrong again. I thought it
> > > > > > > > > was for the total timeout for all retries, so I need to calculate that
> > > > > > > > > and decide to reset the connection as in v2.
> > > > > > > >
> > > > > > > > I think it actually applies to all the retries.
> > > > > > > >
> > > > > > > > > Anyway, you are right. We don't need to do that. Thanks for your patience.
> > > > > > > > >
> > > > > > > > > > } ...
> > > > > > > > > >
> > > > > > > > > > if (timerfd_settime(conn->timer, 0, &it, NULL))
> > > > > > > > > > flow_perror(conn, "failed to set timer");
> > > > > > > > > > ---
--
Stefano
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2025-10-30 8:51 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-17 6:28 [PATCH v6 0/4] Retry SYNs for inbound connections Yumei Huang
2025-10-17 6:28 ` [PATCH v6 1/4] tcp: Rename "retrans" to "retries" Yumei Huang
2025-10-17 6:28 ` [PATCH v6 2/4] util: Introduce read_file() and read_file_integer() function Yumei Huang
2025-10-19 10:07 ` Stefano Brivio
2025-10-21 9:32 ` Yumei Huang
2025-10-21 21:50 ` Stefano Brivio
2025-10-22 0:51 ` David Gibson
2025-10-22 8:42 ` Yumei Huang
2025-10-22 0:55 ` Yumei Huang
2025-10-23 23:04 ` Stefano Brivio
2025-10-24 3:16 ` David Gibson
2025-10-24 6:05 ` Yumei Huang
2025-10-28 7:11 ` Yumei Huang
2025-10-28 11:43 ` Stefano Brivio
2025-10-17 6:28 ` [PATCH v6 3/4] tcp: Resend SYN for inbound connections Yumei Huang
2025-10-22 1:16 ` David Gibson
2025-10-22 1:30 ` Yumei Huang
2025-10-22 2:26 ` David Gibson
2025-10-23 23:04 ` Stefano Brivio
2025-10-24 3:30 ` David Gibson
2025-10-24 8:37 ` Stefano Brivio
2025-10-24 10:55 ` David Gibson
2025-10-27 3:37 ` Yumei Huang
2025-10-27 6:49 ` Stefano Brivio
2025-10-28 7:43 ` Yumei Huang
2025-10-28 11:44 ` Stefano Brivio
2025-10-29 2:31 ` Yumei Huang
2025-10-17 6:28 ` [PATCH v6 4/4] tcp: Update data retransmission timeout Yumei Huang
2025-10-22 1:19 ` David Gibson
2025-10-22 8:40 ` Yumei Huang
2025-10-23 23:04 ` Stefano Brivio
2025-10-28 8:09 ` Yumei Huang
2025-10-28 11:44 ` Stefano Brivio
2025-10-28 11:54 ` Stefano Brivio
2025-10-29 3:06 ` Yumei Huang
2025-10-29 4:38 ` Stefano Brivio
2025-10-29 5:11 ` Yumei Huang
2025-10-29 7:09 ` Stefano Brivio
2025-10-29 7:32 ` Yumei Huang
2025-10-29 7:39 ` Stefano Brivio
2025-10-29 8:59 ` Yumei Huang
2025-10-29 12:18 ` Stefano Brivio
2025-10-30 8:25 ` Yumei Huang
2025-10-30 8:51 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).