public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: Max Chernoff <git@maxchernoff.ca>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: [PATCH v3 08/10] tcp: Allow exceeding the available sending buffer size in window advertisements
Date: Mon,  8 Dec 2025 08:20:21 +0100	[thread overview]
Message-ID: <20251208072024.3884137-9-sbrivio@redhat.com> (raw)
In-Reply-To: <20251208072024.3884137-1-sbrivio@redhat.com>

If the remote peer is advertising a bigger value than our current
sending buffer, it means that a bigger sending buffer is likely to
benefit throughput.

We can get a bigger sending buffer by means of the buffer size
auto-tuning performed by the Linux kernel, which is triggered by
aggressively filling the sending buffer.

Use an adaptive boost factor, up to 150%, depending on:

- how much data we sent so far: we don't want to risk retransmissions
  for short-lived connections, as the latency cost would be
  unacceptable, and

- the current RTT value, as we need a bigger buffer for higher
  transmission delays

The factor we use is not quite a bandwidth-delay product, as we're
missing the time component of the bandwidth, which is not interesting
here: we are trying to make the buffer grow at the beginning of a
connection, progressively, as more data is sent.

The tuning of the amount of boost factor we want to apply was done
somewhat empirically but it appears to yield the available throughput
in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
within a few seconds for the 300 ms case.

Note that we want to apply this factor only if the window advertised
by the peer is bigger than the current sending buffer, as we only need
this for auto-tuning, and we absolutely don't want to incur
unnecessary retransmissions otherwise.

The related condition in tcp_update_seqack_wnd() is not redundant as
there's a subtractive factor, sendq, in the calculation of the window
limit. If the sending buffer is smaller than the peer's advertised
window, the additional limit we might apply might be lower than we
would do otherwise.

Assuming that the sending buffer is reported as 100k, sendq is
20k, we could have these example cases:

1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
   buffer, so we boost its size to 150k, and we limit the window
   to 120k

2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
   buffer, so we aren't trying to trigger buffer auto-tuning and
   we'll stick to the existing, more conservative calculation,
   by limiting the window to 100 - 20 = 80k

If we omitted the new condition, we would always use the boosted
value, that is, 120k, even if potentially causing unnecessary
retransmissions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/tcp.c b/tcp.c
index 6218f7c..07ef5f0 100644
--- a/tcp.c
+++ b/tcp.c
@@ -353,6 +353,13 @@ enum {
 #define LOW_RTT_TABLE_SIZE		8
 #define LOW_RTT_THRESHOLD		10 /* us */
 
+/* Parameters to temporarily exceed sending buffer to force TCP auto-tuning */
+#define SNDBUF_BOOST_BYTES_RTT_LO	2500 /* B * s: no boost until here */
+/* ...examples:  5 MB sent * 500 ns RTT, 250 kB * 10 ms,  8 kB * 300 ms */
+#define SNDBUF_BOOST_FACTOR		150 /* % */
+#define SNDBUF_BOOST_BYTES_RTT_HI	6000 /* apply full boost factor */
+/*		12 MB sent * 500 ns RTT, 600 kB * 10 ms, 20 kB * 300 ms */
+
 /* Ratio of buffer to bandwidth * delay product implying interactive traffic */
 #define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
 
@@ -1035,6 +1042,35 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
 	tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN));
 }
 
+/**
+ * tcp_sndbuf_boost() - Calculate limit of sending buffer to force auto-tuning
+ * @conn:	Connection pointer
+ * @tinfo:	tcp_info from kernel, must be pre-fetched
+ *
+ * Return: increased sending buffer to use as a limit for advertised window
+ */
+static unsigned long tcp_sndbuf_boost(struct tcp_tap_conn *conn,
+				      struct tcp_info_linux *tinfo)
+{
+	unsigned long bytes_rtt_product;
+
+	if (!bytes_acked_cap)
+		return SNDBUF_GET(conn);
+
+	/* This is *not* a bandwidth-delay product, but it's somewhat related:
+	 * as we send more data (usually at the beginning of a connection), we
+	 * try to make the sending buffer progressively grow, with the RTT as a
+	 * factor (longer delay, bigger buffer needed).
+	 */
+	bytes_rtt_product = (long long)tinfo->tcpi_bytes_acked *
+			    tinfo->tcpi_rtt / 1000 / 1000;
+
+	return scale_x_to_y_slope(SNDBUF_GET(conn), bytes_rtt_product,
+				  SNDBUF_BOOST_BYTES_RTT_LO,
+				  SNDBUF_BOOST_BYTES_RTT_HI,
+				  SNDBUF_BOOST_FACTOR);
+}
+
 /**
  * tcp_update_seqack_wnd() - Update ACK sequence and window to guest/tap
  * @c:		Execution context
@@ -1154,6 +1190,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 
 		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
 			limit = 0;
+		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn))
+			limit = tcp_sndbuf_boost(conn, tinfo) - (int)sendq;
 		else
 			limit = SNDBUF_GET(conn) - (int)sendq;
 
-- 
2.43.0


  parent reply	other threads:[~2025-12-08  7:20 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  7:20 [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 01/10] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 02/10] tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 03/10] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-09  5:10   ` David Gibson
2025-12-09 22:49     ` Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 05/10] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 06/10] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-09  5:12   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 07/10] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08  7:20 ` Stefano Brivio [this message]
2025-12-08  8:14   ` [PATCH v3 08/10] tcp: Allow exceeding the available sending buffer size in window advertisements Max Chernoff
2025-12-08  8:15   ` Max Chernoff
2025-12-08  8:27     ` Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 09/10] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 10/10] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08  8:11 ` [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Max Chernoff
2025-12-08  8:25   ` Stefano Brivio
2025-12-08  8:51     ` Max Chernoff
2025-12-08  9:00       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251208072024.3884137-9-sbrivio@redhat.com \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=git@maxchernoff.ca \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).