From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ebrrKUhZ; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 0C8035A0271 for ; Mon, 08 Dec 2025 08:25:35 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1765178735; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lNx0nK6VL2C007JwIVaq258Ld896zs1n3YL8j7NuRkc=; b=ebrrKUhZlKVCMvQyg1FVx7DtfL+o6DuDgOCxgKgmsfUr0dHBxhb8/MK+5Z0yai0gjlVGP7 Ic6QLtH8qXI7kXkkBJ6EEFez9qlj21zIgq9o/87zWWvICDGYxQV/X0x12pNNaswcdnSRl9 jP7gW5s9ebyg8aJN4TmPG5WaonzZy3o= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-34-vQMKzql3M2iRsqN5sHJWRA-1; Mon, 08 Dec 2025 02:25:33 -0500 X-MC-Unique: vQMKzql3M2iRsqN5sHJWRA-1 X-Mimecast-MFC-AGG-ID: vQMKzql3M2iRsqN5sHJWRA_1765178732 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-42b3ed2c3e3so2806476f8f.1 for ; Sun, 07 Dec 2025 23:25:33 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765178732; x=1765783532; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=lNx0nK6VL2C007JwIVaq258Ld896zs1n3YL8j7NuRkc=; b=NlvGh78IhgEU8lUV5xV5Cb8fA7KGUMb/Gfw+xmtZ5JmpPY3pmfaCQ2e/WG4uO7dl7B 7Xo/CE5rbqKZFXo+Vg1NDs6TrP8Xndz3txqnlduw+la8ClSgIKFeHi3Wyq45aC0mWItO jgMn/hk9Gb/44/EKPvddc+WutpD0zgrD71/5xiLcm7UCkFo6oXsgB/k2ySraeNo7qeg7 vU60r72ha20IlY/wV57P4uS8ei/55VjchvEjMYTZz2ecWbaTJJCKoadcMluegYtSndHc yZ1dnfDQV6q7/SIqFSrJS8mnHu1Oq87pf7+37S2M5CQT6Uw9aE10+O8WFbD7Rt088dgv I92g== X-Gm-Message-State: AOJu0YycHpelJfu1w1PYvQ0ldtQjzXgOjMeA/dNPjnSDUeUwDHaI9nim +OzbyA0IFYursDUUlwxjJgEZVZXd49tLBTTg2vxrMemdE8Ma0l4VWDDJZmvhc0jiMVLXvJURtw2 7/okAdPiBtHQQaVyum8NcWduJYzGcaTbZ2LNogaMWth5UlrtaFcV+0wJj1l6g3g== X-Gm-Gg: ASbGnctRdI8bZ8a4PGdxg6GC5VTisO/ZWIj0wbH6U2NXive+aPq95uB8jViLlubYGl/ xUEF7BjkYS37x4wPL7mFi2kWNhpjB69ovz+bbCGPRf4hwI92fHh0to+WAbeHvic7uIPCzh9vRWu +FKNvT+5hOVHI4CcExvLRbyrSwIxBsnDRewxt4b1FPKYqv0um4fkwsURagrgFL14i11XRqn7Wzi BuLnl17lp1Iksxrq/DsGuK/WQPT4dvtvnpp4mOyGlkPhw86OI/Tvo+lP/29cz8D/4pthEjaAgGl H94ixIQEF9jmQgIhNiKVGXc3vxcEo3qzQZ+iyG2OrVX8N/ZgAedY4jCNZIokeFJen+4IcQoOYQj 2SAP3F6zS95xjXGigA1NMmFhYKE6dVx/t5FHtZw== X-Received: by 2002:a05:6000:1a8c:b0:42b:3ace:63cc with SMTP id ffacd0b85a97d-42f89f6e980mr7306203f8f.35.1765178731784; Sun, 07 Dec 2025 23:25:31 -0800 (PST) X-Google-Smtp-Source: AGHT+IHz2sWQu2SPfLvg+3Zbx5QUpuQbk1jizBhsDw2c15uNeVhkE/C++NgOZ27SWiSJz7nGNw30ew== X-Received: by 2002:a05:6000:1a8c:b0:42b:3ace:63cc with SMTP id ffacd0b85a97d-42f89f6e980mr7306180f8f.35.1765178731232; Sun, 07 Dec 2025 23:25:31 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-42f7cbfeadesm23619666f8f.10.2025.12.07.23.25.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 07 Dec 2025 23:25:30 -0800 (PST) Date: Mon, 8 Dec 2025 08:25:29 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Message-ID: <20251208082529.6ce78f65@elisabeth> In-Reply-To: References: <20251208002229.391162-1-sbrivio@redhat.com> <20251208002229.391162-6-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Yplqjqoylqdui5ULgZ2HS-7eWF5ZTsDY5WzefNZBMTQ_1765178732 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: AHBK27B4673Z4PHIKTCXJO3SPM75K53P X-Message-ID-Hash: AHBK27B4673Z4PHIKTCXJO3SPM75K53P X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 8 Dec 2025 16:54:55 +1100 David Gibson wrote: > On Mon, Dec 08, 2025 at 01:22:13AM +0100, Stefano Brivio wrote: > > ...instead of checking if the current sending buffer is less than > > SNDBUF_SMALL, because this isn't simply an optimisation to coalesce > > ACK segments: we rely on having enough data at once from the sender > > to make the buffer grow by means of TCP buffer size tuning > > implemented in the Linux kernel. > > > > This is important if we're trying to maximise throughput, but not > > desirable for interactive traffic, where we want to be transparent as > > possible and avoid introducing unnecessary latency. > > > > Use the tcpi_delivery_rate field reported by the Linux kernel, if > > available, to calculate the current bandwidth-delay product: if it's > > significantly smaller than the available sending buffer, conclude that > > we're not bandwidth-bound and this is likely to be interactive > > traffic, so acknowledge data only as it's acknowledged by the peer. > > > > Conversely, if the bandwidth-delay product is comparable to the size > > of the sending buffer (more than 5%), we're probably bandwidth-bound > > or... bound to be: acknowledge everything in that case. > > Ah, nice. This reasoning is much clearer to me than the previous > spin. > > > > > Signed-off-by: Stefano Brivio > > --- > > tcp.c | 45 +++++++++++++++++++++++++++++++++------------ > > 1 file changed, 33 insertions(+), 12 deletions(-) > > > > diff --git a/tcp.c b/tcp.c > > index 9bf7b8b..533c8a7 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -353,6 +353,9 @@ enum { > > #define LOW_RTT_TABLE_SIZE 8 > > #define LOW_RTT_THRESHOLD 10 /* us */ > > > > +/* Ratio of buffer to bandwidth * delay product implying interactive traffic */ > > +#define SNDBUF_TO_BW_DELAY_INTERACTIVE /* > */ 20 /* (i.e. < 5% of buffer) */ > > + > > #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ > > > > #define CONN_IS_CLOSING(conn) \ > > @@ -426,11 +429,13 @@ socklen_t tcp_info_size; > > sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size) > > > > /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */ > > -#define snd_wnd_cap tcp_info_cap(snd_wnd) > > +#define snd_wnd_cap tcp_info_cap(snd_wnd) > > /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */ > > -#define bytes_acked_cap tcp_info_cap(bytes_acked) > > +#define bytes_acked_cap tcp_info_cap(bytes_acked) > > /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */ > > -#define min_rtt_cap tcp_info_cap(min_rtt) > > +#define min_rtt_cap tcp_info_cap(min_rtt) > > +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */ > > +#define delivery_rate_cap tcp_info_cap(delivery_rate) > > > > /* sendmsg() to socket */ > > static struct iovec tcp_iov [UIO_MAXIOV]; > > @@ -1048,6 +1053,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > socklen_t sl = sizeof(*tinfo); > > struct tcp_info_linux tinfo_new; > > uint32_t new_wnd_to_tap = prev_wnd_to_tap; > > + bool ack_everything = true; > > int s = conn->sock; > > > > /* At this point we could ack all the data we've accepted for forwarding > > @@ -1057,7 +1063,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > * control behaviour. > > * > > * For it to be possible and worth it we need: > > - * - The TCP_INFO Linux extension which gives us the peer acked bytes > > + * - The TCP_INFO Linux extensions which give us the peer acked bytes > > + * and the delivery rate (outbound bandwidth at receiver) > > * - Not to be told not to (force_seq) > > * - Not half-closed in the peer->guest direction > > * With no data coming from the peer, we might not get events which > > @@ -1067,19 +1074,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > * Data goes from socket to socket, with nothing meaningfully "in > > * flight". > > * - Not a pseudo-local connection (e.g. to a VM on the same host) > > - * - Large enough send buffer > > - * In these cases, there's not enough in flight to bother. > > + * If it is, there's not enough in flight to bother. > > + * - Sending buffer significantly larger than bandwidth * delay product > > + * Meaning we're not bandwidth-bound and this is likely to be > > + * interactive traffic where we want to preserve transparent > > + * connection behaviour and latency. > > Do we actually want the sending buffer size here? Or the amount of > buffer that's actually in use (SIOCOUTQ)? If we had a burst transfer > followed by interactive traffic, the kernel could still have a large > send buffer allocated, no? The kernel shrinks it rather fast, and if it's not fast enough, then it still looks like bulk traffic. I tried several metrics (including something based on the data just sent, which approximates SIOCOUTQ), they are not as good as the current buffer size. > > + * > > + * Otherwise, we probably want to maximise throughput, which needs > > + * sending buffer auto-tuning, triggered in turn by filling up the > > + * outbound socket queue. > > */ > > - if (bytes_acked_cap && !force_seq && > > + if (bytes_acked_cap && delivery_rate_cap && !force_seq && > > !CONN_IS_CLOSING(conn) && > > - !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) && > > - (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) { > > + !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) { > > if (!tinfo) { > > tinfo = &tinfo_new; > > if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) > > return 0; > > } > > > > + if ((unsigned)SNDBUF_GET(conn) > (long long)RTT_GET(conn) * > > Using RTT_GET seems odd here, since we just got a more up to date and > precise RTT estimate in tinfo. Oops, right, fixed. -- Stefano