From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
Date: Mon, 8 Dec 2025 08:22:12 +0100 [thread overview]
Message-ID: <20251208082212.5d2abb50@elisabeth> (raw)
In-Reply-To: <aTZlAbzwPAtSai8k@zatzit>
On Mon, 8 Dec 2025 16:41:21 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Mon, Dec 08, 2025 at 01:22:11AM +0100, Stefano Brivio wrote:
> > A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
> > the previous change, because we would generally cause retransmissions
> > for non-local outbound transfers with relatively high (> 100 Mbps)
> > bandwidth and non-local but low (< 5 ms) RTT.
> >
> > Now that retransmissions are less frequent, we don't have a proper
> > trigger to check for acknowledged bytes on the socket, and will
> > generally block the sender for a significant amount of time while
> > we could acknowledge more data, instead.
> >
> > Store the RTT reported by the kernel using an approximation (exponent),
> > to keep flow storage size within two (typical) cachelines. Check for
> > socket updates when half of this time elapses: it should be a good
> > indication of the one-way delay we're interested in (peer to us).
> >
> > Representable values are between 100 us and 3.2768 s, and any value
> > outside this range is clamped to these bounds. This choice appears
> > to be a good trade-off between additional overhead and throughput.
> >
> > This mechanism partially overlaps with the "low RTT" destinations,
> > which we use to infer that a socket is connected to an endpoint to
> > the same machine (while possibly in a different namespace) if the
> > RTT is reported as 10 us or less.
> >
> > This change doesn't, however, conflict with it: we are reading
> > TCP_INFO parameters for local connections anyway, so we can always
> > store the RTT approximation opportunistically.
> >
> > Then, if the RTT is "low", we don't really need a timer to
> > acknowledge data as we'll always acknowledge everything to the
> > sender right away. However, we have limited space in the array where
> > we store addresses of local destination, so the low RTT property of a
> > connection might toggle frequently. Because of this, it's actually
> > helpful to always have the RTT approximation stored.
> >
> > This could probably benefit from a future rework, though, introducing
> > a more integrated approach between these two mechanisms.
> >
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > tcp.c | 28 +++++++++++++++++++++-------
> > tcp_conn.h | 9 +++++++++
> > util.c | 14 ++++++++++++++
> > util.h | 1 +
> > 4 files changed, 45 insertions(+), 7 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 951f434..8eeef4c 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -202,9 +202,13 @@
> > * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> > * either side, the connection is reset
> > *
> > - * - ACK_INTERVAL elapsed after data segment received from tap without having
> > + * - RTT / 2 elapsed after data segment received from tap without having
> > * sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> > - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> > + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> > + *
> > + * RTT, here, is an approximation of the RTT value reported by the kernel via
> > + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> > + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
> > *
> > *
> > * Summary of data flows (with ESTABLISHED event)
> > @@ -341,7 +345,6 @@ enum {
> > #define MSS_DEFAULT 536
> > #define WINDOW_DEFAULT 14600 /* RFC 6928 */
> >
> > -#define ACK_INTERVAL 10 /* ms */
> > #define RTO_INIT 1 /* s, RFC 6298 */
> > #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */
> > #define FIN_TIMEOUT 60
> > @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > }
> >
> > if (conn->flags & ACK_TO_TAP_DUE) {
> > - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > + it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
> > + it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
> > } else if (conn->flags & ACK_FROM_TAP_DUE) {
> > int exp = conn->retries, timeout = RTO_INIT;
> > if (!(conn->events & ESTABLISHED))
> > @@ -608,9 +612,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > it.it_value.tv_sec = ACT_TIMEOUT;
> > }
> >
> > - flow_dbg(conn, "timer expires in %llu.%03llus",
> > - (unsigned long long)it.it_value.tv_sec,
> > - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > + if (conn->flags & ACK_TO_TAP_DUE) {
> > + flow_trace(conn, "timer expires in %lu.%01llums",
> > + (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> > + (unsigned long long)it.it_value.tv_nsec / 1000);
>
> This doesn't look right - you need a % to exclude the whole
> milliseconds here for the fractional part.
Ah, oops, right, and on top of that this can be more than one second
but I forgot to add it. Fixed in v3.
> Plus, it looks like this
> is trying to compute microseconds, which would be 3 digits after the
> . in ms, but the format string accomodates only one.
That was intended, I wanted to show only the first digit of
microseconds given that the smallest values are hundreds of
microseconds, but changed anyway given the possible confusion.
--
Stefano
next prev parent reply other threads:[~2025-12-08 7:22 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-08 0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-08 5:33 ` David Gibson
2025-12-08 0:22 ` [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-08 5:41 ` David Gibson
2025-12-08 7:22 ` Stefano Brivio [this message]
2025-12-08 8:28 ` David Gibson
2025-12-08 0:22 ` [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-08 5:54 ` David Gibson
2025-12-08 7:25 ` Stefano Brivio
2025-12-08 8:31 ` David Gibson
2025-12-08 0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08 6:43 ` David Gibson
2025-12-08 8:11 ` Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08 6:25 ` David Gibson
2025-12-08 7:45 ` Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 9/9] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08 6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
2025-12-08 8:22 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251208082212.5d2abb50@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=git@maxchernoff.ca \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).