From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
Date: Tue, 9 Dec 2025 16:10:36 +1100 [thread overview]
Message-ID: <aTevTHB6XWn9qLpc@zatzit> (raw)
In-Reply-To: <20251208072024.3884137-5-sbrivio@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 8125 bytes --]
On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote:
> A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
> the previous change, because we would generally cause retransmissions
> for non-local outbound transfers with relatively high (> 100 Mbps)
> bandwidth and non-local but low (< 5 ms) RTT.
>
> Now that retransmissions are less frequent, we don't have a proper
> trigger to check for acknowledged bytes on the socket, and will
> generally block the sender for a significant amount of time while
> we could acknowledge more data, instead.
>
> Store the RTT reported by the kernel using an approximation (exponent),
> to keep flow storage size within two (typical) cachelines. Check for
> socket updates when half of this time elapses: it should be a good
> indication of the one-way delay we're interested in (peer to us).
>
> Representable values are between 100 us and 3.2768 s, and any value
> outside this range is clamped to these bounds. This choice appears
> to be a good trade-off between additional overhead and throughput.
>
> This mechanism partially overlaps with the "low RTT" destinations,
> which we use to infer that a socket is connected to an endpoint to
> the same machine (while possibly in a different namespace) if the
> RTT is reported as 10 us or less.
>
> This change doesn't, however, conflict with it: we are reading
> TCP_INFO parameters for local connections anyway, so we can always
> store the RTT approximation opportunistically.
>
> Then, if the RTT is "low", we don't really need a timer to
> acknowledge data as we'll always acknowledge everything to the
> sender right away. However, we have limited space in the array where
> we store addresses of local destination, so the low RTT property of a
> connection might toggle frequently. Because of this, it's actually
> helpful to always have the RTT approximation stored.
>
> This could probably benefit from a future rework, though, introducing
> a more integrated approach between these two mechanisms.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> tcp.c | 30 +++++++++++++++++++++++-------
> tcp_conn.h | 9 +++++++++
> util.c | 14 ++++++++++++++
> util.h | 1 +
> 4 files changed, 47 insertions(+), 7 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 28d3304..0167121 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -202,9 +202,13 @@
> * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> * either side, the connection is reset
> *
> - * - ACK_INTERVAL elapsed after data segment received from tap without having
> + * - RTT / 2 elapsed after data segment received from tap without having
> * sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> + *
> + * RTT, here, is an approximation of the RTT value reported by the kernel via
> + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
> *
> *
> * Summary of data flows (with ESTABLISHED event)
> @@ -341,7 +345,6 @@ enum {
> #define MSS_DEFAULT 536
> #define WINDOW_DEFAULT 14600 /* RFC 6928 */
>
> -#define ACK_INTERVAL 10 /* ms */
> #define RTO_INIT 1 /* s, RFC 6298 */
> #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */
> #define FIN_TIMEOUT 60
> @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> }
>
> if (conn->flags & ACK_TO_TAP_DUE) {
> - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> + it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
> + it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
> } else if (conn->flags & ACK_FROM_TAP_DUE) {
> int exp = conn->retries, timeout = RTO_INIT;
> if (!(conn->events & ESTABLISHED))
> @@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> it.it_value.tv_sec = ACT_TIMEOUT;
> }
>
> - flow_dbg(conn, "timer expires in %llu.%03llus",
> - (unsigned long long)it.it_value.tv_sec,
> - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> + if (conn->flags & ACK_TO_TAP_DUE) {
> + flow_trace(conn, "timer expires in %llu.%03llums",
> + (unsigned long)it.it_value.tv_sec * 1000 +
> + (unsigned long long)it.it_value.tv_nsec %
> + (1000 * 1000),
> + (unsigned long long)it.it_value.tv_nsec / 1000);
This is the wrong way around. The ms part needs to be:
tv_sec * 1000 + tv_nsec / 1000000
and the fractional (us) part:
(tv_nsec / 1000) % 1000
(or if you did just want a single digit after the ., then:
tv_nsec / 100000 % 10
)
> + } else {
> + flow_dbg(conn, "timer expires in %llu.%03llus",
> + (unsigned long long)it.it_value.tv_sec,
> + (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> + }
>
> if (timerfd_settime(conn->timer, 0, &it, NULL))
> flow_perror(conn, "failed to set timer");
> @@ -1144,6 +1156,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> conn_flag(c, conn, ACK_TO_TAP_DUE);
>
> out:
> + /* Opportunistically store RTT approximation on valid TCP_INFO data */
> + if (tinfo)
> + RTT_SET(conn, tinfo->tcpi_rtt);
> +
> return new_wnd_to_tap != prev_wnd_to_tap ||
> conn->seq_ack_to_tap != prev_ack_to_tap;
> }
> diff --git a/tcp_conn.h b/tcp_conn.h
> index e36910c..9c6ff9e 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -49,6 +49,15 @@ struct tcp_tap_conn {
> #define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
> #define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS))
>
> +#define RTT_EXP_BITS 4
> + unsigned int rtt_exp :RTT_EXP_BITS;
> +#define RTT_EXP_MAX MAX_FROM_BITS(RTT_EXP_BITS)
> +#define RTT_STORE_MIN 100 /* us, minimum representable */
> +#define RTT_STORE_MAX ((long)(RTT_STORE_MIN << RTT_EXP_MAX))
> +#define RTT_SET(conn, rtt) \
> + (conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
> +#define RTT_GET(conn) (RTT_STORE_MIN << conn->rtt_exp)
> +
> int sock :FD_REF_BITS;
>
> uint8_t events;
> diff --git a/util.c b/util.c
> index 2232a24..bfeb619 100644
> --- a/util.c
> +++ b/util.c
> @@ -614,6 +614,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
> * fls() - Find last (most significant) bit set in word
> * @x: Word
> *
> + * Note: unlike ffs() and other implementations of fls(), notably the one from
> + * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
> + *
> * Return: position of most significant bit set, starting from 0, -1 if none
> */
> int fls(unsigned long x)
> @@ -629,6 +632,17 @@ int fls(unsigned long x)
> return y;
> }
>
> +/**
> + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
> + * @x: Argument
> + *
> + * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
> + */
> +int ilog2(unsigned long x)
> +{
> + return fls(x);
> +}
> +
> /**
> * write_file() - Replace contents of file with a string
> * @path: File to write
> diff --git a/util.h b/util.h
> index 744880b..f7a941f 100644
> --- a/util.h
> +++ b/util.h
> @@ -233,6 +233,7 @@ int output_file_open(const char *path, int flags);
> void pidfile_write(int fd, pid_t pid);
> int __daemon(int pidfile_fd, int devnull_fd);
> int fls(unsigned long x);
> +int ilog2(unsigned long x);
> int write_file(const char *path, const char *buf);
> intmax_t read_file_integer(const char *path, intmax_t fallback);
> int write_all_buf(int fd, const void *buf, size_t len);
> --
> 2.43.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-12-09 5:13 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-08 7:20 [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 01/10] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-09 5:05 ` David Gibson
2025-12-08 7:20 ` [PATCH v3 02/10] tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% Stefano Brivio
2025-12-09 5:05 ` David Gibson
2025-12-08 7:20 ` [PATCH v3 03/10] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-09 5:10 ` David Gibson [this message]
2025-12-09 22:49 ` Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 05/10] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 06/10] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-09 5:12 ` David Gibson
2025-12-08 7:20 ` [PATCH v3 07/10] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 08/10] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08 8:14 ` Max Chernoff
2025-12-08 8:15 ` Max Chernoff
2025-12-08 8:27 ` Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 09/10] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 10/10] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08 8:11 ` [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Max Chernoff
2025-12-08 8:25 ` Stefano Brivio
2025-12-08 8:51 ` Max Chernoff
2025-12-08 9:00 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aTevTHB6XWn9qLpc@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=git@maxchernoff.ca \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).