From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=KZ3zxcP5; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 97CB05A004E for ; Mon, 08 Dec 2025 01:21:05 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1765153264; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wl6YOou6Sddy+Szg3WscIsQviAgleTcrnrr+kfRrMq8=; b=KZ3zxcP5tGzlqLc0oiohTfrArcgna0zHPh1v+0zFFQNCtNwOzvoCl6z95WniI6F0mZYgfm +dljVFHyNK0J7+swscKuoYxF4qFF2u62t5zDcq6eShn8Mj8Uuvkzn5fb5csmTyErL+yGlz /ukVfJsPnL4bbIhPPOvrsQXazf1yISQ= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-137-DGJfFeoZOhOCRDwNP4E4AA-1; Sun, 07 Dec 2025 19:21:03 -0500 X-MC-Unique: DGJfFeoZOhOCRDwNP4E4AA-1 X-Mimecast-MFC-AGG-ID: DGJfFeoZOhOCRDwNP4E4AA_1765153262 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-4775d8428e8so27403875e9.0 for ; Sun, 07 Dec 2025 16:21:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765153262; x=1765758062; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wl6YOou6Sddy+Szg3WscIsQviAgleTcrnrr+kfRrMq8=; b=iFgt2ZK3qyjDQs6SIJ7xkUYQOcfyFiUR/sItZeKoFMUHMI6cYtKp+/fNy8+rAOW9kp 6JphJooMbDSxmzoLma0Tas560og+kPhhnJiskWTVfjtRcFfm3NHaXev/0XU4l9ahPH85 aowUzrmk3y12Xh56mUunAIJWjXWsFsNxA5oRepkzYgDrsga/jEotICmvz5TXDI/ph7aa 1lZm3SrJOXFzHM+axBIb0JGL/EyunnmPDXCTNO63VPxdXbT6VNdQTilBupIHCiVwB3ZG 3nnFTypX6+nvCb1JTe1zBvu6m6I7jLPZ6WEJolKxgkCivDPaBRA65N5tSEVWOOkw1xc5 nHIQ== X-Gm-Message-State: AOJu0YwIwGPTUFzaWMPrFqS61molJUw8yk/wx92t4T0KVHPy3yvKNInr v2qfRdg00eqblLYsR6iNhigjofzsar09wRaB2wvrdLeu71mPeeqCyr5m6PmO5thPl5UC+vL/Q1u OS9pB1VSLXi7NgAkhKd4xRIV67xrpVk/uMA2Bq7wvppM4adaj5A/1UAeZ5Q7q3A== X-Gm-Gg: ASbGncvFteU5+HZrgSGHHZ1hllG83lmZRzocQMrLhBqcrwHZ+IDQvVAOPMrSHj++mEK PPKkvchc3Jc6Y9tseC/D3cV6ZHhCQaSTkCq5hxLAxFS6xkQXFjwh1kcIqFHSF1ITTZbrM94ests KATP6MK572RS0/u08mKT8Xioq8WK6cN2fV1H65DBDxDp7VlgvjAUxR1QTEp40rxFwQ6ZPu+l8JZ WRty/znJWON5xz9CYzIbEvwo7MpvrkIvUgzv4zezF+BXMzLg3RTsIk2w6ieYb86TGIXSlnjvWav x0x2QE8PSNmFL1KNI2O/QHDtCnKyQPh8NCWW+ZxzHf84wKPj3M1hmptPBNwiiIHINZcL0XKdpe1 DfjTtQBAzJFb8qLvN5ZXehtt5Oq8tksNOqp39Jg== X-Received: by 2002:a05:600c:828d:b0:477:7b16:5fb1 with SMTP id 5b1f17b1804b1-47939df0045mr70206705e9.7.1765153261744; Sun, 07 Dec 2025 16:21:01 -0800 (PST) X-Google-Smtp-Source: AGHT+IGuMbYnUJpRNpxv7xMtwPtQdCFiLjr2aB/8XEk6TV3DscbaYgD99/yrkleOJ0QztAAHDSO/tg== X-Received: by 2002:a05:600c:828d:b0:477:7b16:5fb1 with SMTP id 5b1f17b1804b1-47939df0045mr70206465e9.7.1765153261173; Sun, 07 Dec 2025 16:21:01 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-479310a6db3sm226032315e9.1.2025.12.07.16.21.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 07 Dec 2025 16:21:00 -0800 (PST) Date: Mon, 8 Dec 2025 01:20:59 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements Message-ID: <20251208012059.36459e27@elisabeth> In-Reply-To: References: <20251204074542.2156548-1-sbrivio@redhat.com> <20251204074542.2156548-7-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: P42Epayg4heVG8Anlg7k6uD1tBaqiOjSt597uvOlL1o_1765153262 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: SEUEAGVWEC2W7TJ5LFX7L7QGTW3YFVHH X-Message-ID-Hash: SEUEAGVWEC2W7TJ5LFX7L7QGTW3YFVHH X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 5 Dec 2025 13:34:07 +1100 David Gibson wrote: > On Thu, Dec 04, 2025 at 08:45:39AM +0100, Stefano Brivio wrote: > > ...under two conditions: > > > > - the remote peer is advertising a bigger value to us, meaning that a > > bigger sending buffer is likely to benefit throughput, AND > > I think this condition is redundant: if the remote peer is advertising > less, we'll clamp new_wnd_to_tap to that value anyway. I almost fell for this. We have a subtractive term in the expression, so it's not actually the case. If the remote peer is advertising a smaller window, we just take buffer size *minus pending bytes*, as limit, which can be smaller compared to the window advertised by the peer. If it's advertising a bigger window, we take an increased buffer size minus pending bytes, as limit, which can be bigger than the peer's window, so we'll use the peer's window as limit instead. I added an example in v2 (now 7/9). > > - this is not a short-lived connection, where the latency cost of > > retransmissions would be otherwise unacceptable. > > > > By doing this, we can reliably trigger TCP buffer size auto-tuning (as > > long as it's available) on bulk data transfers. > > > > Signed-off-by: Stefano Brivio > > --- > > tcp.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/tcp.c b/tcp.c > > index 2220059..454df69 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -353,6 +353,13 @@ enum { > > #define LOW_RTT_TABLE_SIZE 8 > > #define LOW_RTT_THRESHOLD 10 /* us */ > > > > +/* Try to avoid retransmissions to improve latency on short-lived connections */ > > +#define SHORT_CONN_BYTES (16ULL * 1024 * 1024) > > + > > +/* Temporarily exceed available sending buffer to force TCP auto-tuning */ > > +#define SNDBUF_BOOST_FACTOR 150 /* % */ > > +#define SNDBUF_BOOST(x) ((x) * SNDBUF_BOOST_FACTOR / 100) > > For the short term, the fact this works empirically is enough. For > the longer term, it would be nice to have a better understanding of > what this "overcommit" amount is actually estimating. > > I think what we're looking for is an estimate of the number of bytes > that will have left the buffer by the time the guest gets back to us. So: > * I don't think we want the bandwidth-delay product here (which I'm now using earlier in the series) because the purpose here is to grow the buffer at the beginning of a connection, if it looks like bulk traffic. So we want to progressively exploit auto-tuning as long as we're limited by a small buffer, but not later. At some point we want to finally switch to the window advertised by the peer. Well, I tried with the bandwidth-delay product in any case, but it's not really helping with auto-tuning. It turns out that auto-tuning is fundamentally different at the beginning anyway. > Alas, I don't see a way to estimate either of those from the > information we already track - we'd need additional bookkeeping. It's all in struct tcp_info, it's called tcpi_delivery_rate. There are other interesting bits there, by the way, that could be used in a further refinement. > > #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ > > > > #define CONN_IS_CLOSING(conn) \ > > @@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > > > if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */ > > limit = 0; > > + else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) && > > + tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES) > > This is pretty subtle, I think it would be worth having some rationale > in a comment, not just the commit message. I turned the macro into a new function and added comments there, in v2. > > + limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq; > > else > > limit = SNDBUF_GET(conn) - (int)sendq; > > > > -- > > 2.43.0 -- Stefano