From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ajGKJr/D; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 276555A0271 for ; Mon, 08 Dec 2025 08:22:31 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1765178550; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NN1t/TAXdq0KfXIpFiOo3of07qs3Cwuz1x8N99+xeb8=; b=ajGKJr/DsPm/HT1TmT6NP+kZMoAHbGHw0ULNdLmKhrKoAjJrd/hXgnUt6++JCZCAN1ypVL JHcmjYBvWCnZet/BW4UED+7JsZPnX5DRHT+APg3TFlrMD6xsZfME4ejKdrYIlUNgnYy5Vl LrEVBFQJcO43kICFBEF6LwliZJL5ZsQ= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-627-ssVqIT-KMcuGadRO4hoQsg-1; Mon, 08 Dec 2025 02:22:18 -0500 X-MC-Unique: ssVqIT-KMcuGadRO4hoQsg-1 X-Mimecast-MFC-AGG-ID: ssVqIT-KMcuGadRO4hoQsg_1765178537 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-47775585257so26566715e9.1 for ; Sun, 07 Dec 2025 23:22:17 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765178535; x=1765783335; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=NN1t/TAXdq0KfXIpFiOo3of07qs3Cwuz1x8N99+xeb8=; b=X8dk57B2nYOaRI59/fMeRGcQxDWvwL1dyAdcTVGhBopVgo4m4Sc7uiNyWoVs3HpPao RgHvMQ6eSzELIFPRFITqhE2uWTinApHo/5u3vf4hA9s8RrDwijVxj2T5IOiP8pKsi9H4 WVNx71R9Z30EsfaNeFuiZU2sXFxSGfeIOiN2DrXvBZZkJvOjU39RxLm2xlfMC5uyykoE tvTwj9qVphjIZ0leskY9vhTBzIEpLdzPqVeAICeH/UwMGeQPXyFf7TnQKMaSdXykbY6n DXYUrpfYp3jgC+ez8XaVj1sS4bLNXtSwbFnMJs0Kfv6auure+8kjxCM8HTQIwU/Z8BKy +6sA== X-Gm-Message-State: AOJu0YxjULZyc4qGvx2XoCLUeYaRWwehZKx+GqhL4z87mjgT68SghYNh 5Det3/y5p62wAt3sFVgqIO6YqAspf+UIS4cpubm28Y7NtkTEaf4bDT3cBWOkc0bQ55u6Ws7l7DL kx47A/FsmAFBY8hkegxbmsBpbsiSQNkZAko6TG32cCWXVA5w6edG2hfzG37TG6w== X-Gm-Gg: ASbGnctBvSQoKhUNnFWCqdTh4Hpp75jssKqPUdTcBfs4jw2Bnudh6Mkf9xA6swkiB/S i5XCdpLhoKd682CCb6x4FZa/e4iF3HLmbZEvgdtkvGd9m9iEaYdfY3Itmby+9c3HSF31mHDrjUV hJMhuQigJsvbBVikPFE+I8eZfTTLT/7cjL+b0itW8ipuJJfJOij/nzCaouLl3ALQJSjHhy4LD0l W9/HvELFzAgai6VAulxIzfbeeZBsWpb2cbasArC6zoY0S3BzLtwfry8vNv57w3ML1ape52lFDQu ZliXsc6pA53eeBRdJnMiiNuBaGtk0ZzxLdICmNKI8cGPPXwL3s1Ab75GvB1W8ZKsqDdj8ziucy4 jBmZc7qHHOAZ3tCd1nSuv X-Received: by 2002:a05:600c:3487:b0:477:9671:3a42 with SMTP id 5b1f17b1804b1-47939e43a49mr74360025e9.35.1765178535315; Sun, 07 Dec 2025 23:22:15 -0800 (PST) X-Google-Smtp-Source: AGHT+IGPk2iU6y9M55W0I0EaUSW51tFcq2FB7hUf8ZXLbbUux77zmsDwheZSvksZoOjZvZdTXiI0FA== X-Received: by 2002:a05:600c:3487:b0:477:9671:3a42 with SMTP id 5b1f17b1804b1-47939e43a49mr74359775e9.35.1765178534640; Sun, 07 Dec 2025 23:22:14 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-42f7d2226e7sm24660652f8f.27.2025.12.07.23.22.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 07 Dec 2025 23:22:13 -0800 (PST) Date: Mon, 8 Dec 2025 08:22:12 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Message-ID: <20251208082212.5d2abb50@elisabeth> In-Reply-To: References: <20251208002229.391162-1-sbrivio@redhat.com> <20251208002229.391162-4-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: H4O-j0L13EVPBqpXBG4e-kLFY_CHIMfT6om1My7DtdE_1765178537 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: 3J7R3BDNOVKUTVY5BKF7FSUZKED3LLYE X-Message-ID-Hash: 3J7R3BDNOVKUTVY5BKF7FSUZKED3LLYE X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 8 Dec 2025 16:41:21 +1100 David Gibson wrote: > On Mon, Dec 08, 2025 at 01:22:11AM +0100, Stefano Brivio wrote: > > A fixed 10 ms ACK_INTERVAL timer value served us relatively well until > > the previous change, because we would generally cause retransmissions > > for non-local outbound transfers with relatively high (> 100 Mbps) > > bandwidth and non-local but low (< 5 ms) RTT. > > > > Now that retransmissions are less frequent, we don't have a proper > > trigger to check for acknowledged bytes on the socket, and will > > generally block the sender for a significant amount of time while > > we could acknowledge more data, instead. > > > > Store the RTT reported by the kernel using an approximation (exponent), > > to keep flow storage size within two (typical) cachelines. Check for > > socket updates when half of this time elapses: it should be a good > > indication of the one-way delay we're interested in (peer to us). > > > > Representable values are between 100 us and 3.2768 s, and any value > > outside this range is clamped to these bounds. This choice appears > > to be a good trade-off between additional overhead and throughput. > > > > This mechanism partially overlaps with the "low RTT" destinations, > > which we use to infer that a socket is connected to an endpoint to > > the same machine (while possibly in a different namespace) if the > > RTT is reported as 10 us or less. > > > > This change doesn't, however, conflict with it: we are reading > > TCP_INFO parameters for local connections anyway, so we can always > > store the RTT approximation opportunistically. > > > > Then, if the RTT is "low", we don't really need a timer to > > acknowledge data as we'll always acknowledge everything to the > > sender right away. However, we have limited space in the array where > > we store addresses of local destination, so the low RTT property of a > > connection might toggle frequently. Because of this, it's actually > > helpful to always have the RTT approximation stored. > > > > This could probably benefit from a future rework, though, introducing > > a more integrated approach between these two mechanisms. > > > > Signed-off-by: Stefano Brivio > > --- > > tcp.c | 28 +++++++++++++++++++++------- > > tcp_conn.h | 9 +++++++++ > > util.c | 14 ++++++++++++++ > > util.h | 1 + > > 4 files changed, 45 insertions(+), 7 deletions(-) > > > > diff --git a/tcp.c b/tcp.c > > index 951f434..8eeef4c 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -202,9 +202,13 @@ > > * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on > > * either side, the connection is reset > > * > > - * - ACK_INTERVAL elapsed after data segment received from tap without having > > + * - RTT / 2 elapsed after data segment received from tap without having > > * sent an ACK segment, or zero-sized window advertised to tap/guest (flag > > - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent > > + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. > > + * > > + * RTT, here, is an approximation of the RTT value reported by the kernel via > > + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to > > + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly. > > * > > * > > * Summary of data flows (with ESTABLISHED event) > > @@ -341,7 +345,6 @@ enum { > > #define MSS_DEFAULT 536 > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > -#define ACK_INTERVAL 10 /* ms */ > > #define RTO_INIT 1 /* s, RFC 6298 */ > > #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ > > #define FIN_TIMEOUT 60 > > @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > } > > > > if (conn->flags & ACK_TO_TAP_DUE) { > > - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; > > + it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000); > > + it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000; > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > int exp = conn->retries, timeout = RTO_INIT; > > if (!(conn->events & ESTABLISHED)) > > @@ -608,9 +612,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > it.it_value.tv_sec = ACT_TIMEOUT; > > } > > > > - flow_dbg(conn, "timer expires in %llu.%03llus", > > - (unsigned long long)it.it_value.tv_sec, > > - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > > + if (conn->flags & ACK_TO_TAP_DUE) { > > + flow_trace(conn, "timer expires in %lu.%01llums", > > + (unsigned long)it.it_value.tv_nsec / 1000 / 1000, > > + (unsigned long long)it.it_value.tv_nsec / 1000); > > This doesn't look right - you need a % to exclude the whole > milliseconds here for the fractional part. Ah, oops, right, and on top of that this can be more than one second but I forgot to add it. Fixed in v3. > Plus, it looks like this > is trying to compute microseconds, which would be 3 digits after the > . in ms, but the format string accomodates only one. That was intended, I wanted to show only the first digit of microseconds given that the smallest values are hundreds of microseconds, but changed anyway given the possible confusion. -- Stefano