From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH v3 06/10] tcp: Acknowledge everything if it looks like bulk traffic, not interactive
Date: Tue, 9 Dec 2025 16:12:32 +1100 [thread overview]
Message-ID: <aTevwGuF2l4gXHew@zatzit> (raw)
In-Reply-To: <20251208072024.3884137-7-sbrivio@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 6292 bytes --]
On Mon, Dec 08, 2025 at 08:20:19AM +0100, Stefano Brivio wrote:
> ...instead of checking if the current sending buffer is less than
> SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
> ACK segments: we rely on having enough data at once from the sender
> to make the buffer grow by means of TCP buffer size tuning
> implemented in the Linux kernel.
>
> This is important if we're trying to maximise throughput, but not
> desirable for interactive traffic, where we want to be transparent as
> possible and avoid introducing unnecessary latency.
>
> Use the tcpi_delivery_rate field reported by the Linux kernel, if
> available, to calculate the current bandwidth-delay product: if it's
> significantly smaller than the available sending buffer, conclude that
> we're not bandwidth-bound and this is likely to be interactive
> traffic, so acknowledge data only as it's acknowledged by the peer.
>
> Conversely, if the bandwidth-delay product is comparable to the size
> of the sending buffer (more than 5%), we're probably bandwidth-bound
> or... bound to be: acknowledge everything in that case.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 45 +++++++++++++++++++++++++++++++++------------
> 1 file changed, 33 insertions(+), 12 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index b2e4174..923c1f2 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -353,6 +353,9 @@ enum {
> #define LOW_RTT_TABLE_SIZE 8
> #define LOW_RTT_THRESHOLD 10 /* us */
>
> +/* Ratio of buffer to bandwidth * delay product implying interactive traffic */
> +#define SNDBUF_TO_BW_DELAY_INTERACTIVE /* > */ 20 /* (i.e. < 5% of buffer) */
> +
> #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */
>
> #define CONN_IS_CLOSING(conn) \
> @@ -426,11 +429,13 @@ socklen_t tcp_info_size;
> sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
>
> /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */
> -#define snd_wnd_cap tcp_info_cap(snd_wnd)
> +#define snd_wnd_cap tcp_info_cap(snd_wnd)
> /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */
> -#define bytes_acked_cap tcp_info_cap(bytes_acked)
> +#define bytes_acked_cap tcp_info_cap(bytes_acked)
> /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */
> -#define min_rtt_cap tcp_info_cap(min_rtt)
> +#define min_rtt_cap tcp_info_cap(min_rtt)
> +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */
> +#define delivery_rate_cap tcp_info_cap(delivery_rate)
>
> /* sendmsg() to socket */
> static struct iovec tcp_iov [UIO_MAXIOV];
> @@ -1050,6 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> socklen_t sl = sizeof(*tinfo);
> struct tcp_info_linux tinfo_new;
> uint32_t new_wnd_to_tap = prev_wnd_to_tap;
> + bool ack_everything = true;
> int s = conn->sock;
>
> /* At this point we could ack all the data we've accepted for forwarding
> @@ -1059,7 +1065,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> * control behaviour.
> *
> * For it to be possible and worth it we need:
> - * - The TCP_INFO Linux extension which gives us the peer acked bytes
> + * - The TCP_INFO Linux extensions which give us the peer acked bytes
> + * and the delivery rate (outbound bandwidth at receiver)
> * - Not to be told not to (force_seq)
> * - Not half-closed in the peer->guest direction
> * With no data coming from the peer, we might not get events which
> @@ -1069,19 +1076,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> * Data goes from socket to socket, with nothing meaningfully "in
> * flight".
> * - Not a pseudo-local connection (e.g. to a VM on the same host)
> - * - Large enough send buffer
> - * In these cases, there's not enough in flight to bother.
> + * If it is, there's not enough in flight to bother.
> + * - Sending buffer significantly larger than bandwidth * delay product
> + * Meaning we're not bandwidth-bound and this is likely to be
> + * interactive traffic where we want to preserve transparent
> + * connection behaviour and latency.
> + *
> + * Otherwise, we probably want to maximise throughput, which needs
> + * sending buffer auto-tuning, triggered in turn by filling up the
> + * outbound socket queue.
> */
> - if (bytes_acked_cap && !force_seq &&
> + if (bytes_acked_cap && delivery_rate_cap && !force_seq &&
> !CONN_IS_CLOSING(conn) &&
> - !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> - (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> + !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) {
> if (!tinfo) {
> tinfo = &tinfo_new;
> if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> return 0;
> }
>
> + if ((unsigned)SNDBUF_GET(conn) > (long long)tinfo->tcpi_rtt *
> + tinfo->tcpi_delivery_rate /
> + 1000 / 1000 *
> + SNDBUF_TO_BW_DELAY_INTERACTIVE)
> + ack_everything = false;
> + }
> +
> + if (ack_everything) {
> + /* Fall back to acknowledging everything we got */
> + conn->seq_ack_to_tap = conn->seq_from_tap;
> + } else {
> /* This trips a cppcheck bug in some versions, including
> * cppcheck 2.18.3.
> * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
> @@ -1089,9 +1113,6 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> /* cppcheck-suppress [uninitvar,unmatchedSuppression] */
> conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
> conn->seq_init_from_tap;
> - } else {
> - /* Fall back to acknowledging everything we got */
> - conn->seq_ack_to_tap = conn->seq_from_tap;
> }
>
> /* It's occasionally possible for us to go from using the fallback above
> --
> 2.43.0
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-12-09 5:13 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-08 7:20 [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 01/10] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-09 5:05 ` David Gibson
2025-12-08 7:20 ` [PATCH v3 02/10] tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% Stefano Brivio
2025-12-09 5:05 ` David Gibson
2025-12-08 7:20 ` [PATCH v3 03/10] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-09 5:10 ` David Gibson
2025-12-09 22:49 ` Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 05/10] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 06/10] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-09 5:12 ` David Gibson [this message]
2025-12-08 7:20 ` [PATCH v3 07/10] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 08/10] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08 8:14 ` Max Chernoff
2025-12-08 8:15 ` Max Chernoff
2025-12-08 8:27 ` Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 09/10] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08 7:20 ` [PATCH v3 10/10] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08 8:11 ` [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Max Chernoff
2025-12-08 8:25 ` Stefano Brivio
2025-12-08 8:51 ` Max Chernoff
2025-12-08 9:00 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aTevwGuF2l4gXHew@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=git@maxchernoff.ca \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).