From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=UHoKLD59; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id CD7F75A0625 for ; Tue, 09 Dec 2025 23:49:56 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1765320595; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YiDLfEhuM7n7O25YZBXwv4lA/QsmKLNofZ6Fyvjam5k=; b=UHoKLD593BApHX5Mo7tCZIERpq1AL3w12/Jp/myP/iYA3SsFXZ/robfOV5O6pjPRhk5idq QOJct7n0q2nOv7x2MyhuG59pXe40XmzqGYZ2BNjBrvrt5Ls3z7IsaYJZKJ91QUDYjrwYOO Vtk6ii1AghrRyEhxlgkH6jGD7CEyOQM= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-595-l9GM6iIiOXO1fRJz8-t9bQ-1; Tue, 09 Dec 2025 17:49:54 -0500 X-MC-Unique: l9GM6iIiOXO1fRJz8-t9bQ-1 X-Mimecast-MFC-AGG-ID: l9GM6iIiOXO1fRJz8-t9bQ_1765320593 Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-4776079ada3so58153885e9.1 for ; Tue, 09 Dec 2025 14:49:54 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765320593; x=1765925393; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YiDLfEhuM7n7O25YZBXwv4lA/QsmKLNofZ6Fyvjam5k=; b=uSWjNu62oZNxYALxNyli1GS89D9EDbzgy6MQqn9jiQulBebn6MpC9T24XD00tsrU1T TGmcavOJdAFgRs8xHLA+k+7HMvkBS7QsPztg9rKnaqXhViusrpL/h1SSyV8W1kxmEgMX BZjoN5LNVAtHIWrQDz/JMgy2h8fzKTOADMBirU+jrkHGprlneHwOCBCKywJ4YXwQyc+I EcMWBOfOGCZwqTqy0RFCbV/koPzyNSsmsYYZK4QU1PmjVbslLWoDJHVFhUMyxN1h+rti DOw4XR6Cmiy5EmaFPhVPCGu461Z4WOrxsak6Z2a755MxwWZQWIc2or6/kBv3E76iZN3f VSfA== X-Gm-Message-State: AOJu0YxYgSa8BTxCAZevHQOrY1LBieTWdTC+T/9J3Shyt4t2qZ4/8Zin XRZmS2a6C3SmS2CHSv4By9jVwEDFZ8TaQ2cXDojRayryziQaywo9n6IlO/EHzBMqHW4EDOm3wnE 6lny+29clweSApGIv3doWV9kg1NixZvV+O2Y4KamWYXgNkVFL3wz6eQ== X-Gm-Gg: ASbGncvFmo4cPto/9PBbAMEHVYnZ0xeWT9BTP8lg5bD5mPaB0LSl1ejcO0NF8RNGi+Z J86c0nQwxubHEUxeeBMjJo5qf6/qgLIHyawPcBespmwcic35WDV8s9qlFghCs2vXxcSnPEHFZpG B+0bTrUl4I809fDOYdyRD9fqdYCcOHqxx4WXiz7BQ/vxoXlzamnzgb67KCvUlr78xaGLDtiKxTx 9ZL5YwuRZu5rIOjVkOc2i3dNNQyt3IVtNYbs55pJAqCRnvxH6WVIZq8ZegitEnfelEfsbT0SEN/ JUiSF7qPmpJ8iWCOuvb1lSAzXXmQxbfDIhS2TjOEKYYsaExfIXzX7NW8LvhsEMRw9IRYI3UmUPh 3GlGC4jQlJNdftZsZmxq559KD/6CEcHY5ciBc/A== X-Received: by 2002:a05:600c:1392:b0:477:6374:6347 with SMTP id 5b1f17b1804b1-47a8378cdafmr3730655e9.22.1765320592876; Tue, 09 Dec 2025 14:49:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IEGVfO7jUwXQiew82sX2pvUIraWUKltIn9jmQyQvqm1So21H69yihBkTaK72Od0tE1Dk8AICw== X-Received: by 2002:a05:600c:1392:b0:477:6374:6347 with SMTP id 5b1f17b1804b1-47a8378cdafmr3730465e9.22.1765320592172; Tue, 09 Dec 2025 14:49:52 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-47a7d9b5facsm28001345e9.7.2025.12.09.14.49.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Dec 2025 14:49:51 -0800 (PST) Date: Tue, 9 Dec 2025 23:49:50 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Message-ID: <20251209234950.53067946@elisabeth> In-Reply-To: References: <20251208072024.3884137-1-sbrivio@redhat.com> <20251208072024.3884137-5-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: -AYHmC7jcpP1QDmyXBVY6ZPrLlpRL1NjfwgUKxaSSGc_1765320593 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: CJBZDACCNDGMHJJK33W6B5DUZQLMRWFD X-Message-ID-Hash: CJBZDACCNDGMHJJK33W6B5DUZQLMRWFD X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, 9 Dec 2025 16:10:36 +1100 David Gibson wrote: > On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote: > > A fixed 10 ms ACK_INTERVAL timer value served us relatively well until > > the previous change, because we would generally cause retransmissions > > for non-local outbound transfers with relatively high (> 100 Mbps) > > bandwidth and non-local but low (< 5 ms) RTT. > > > > Now that retransmissions are less frequent, we don't have a proper > > trigger to check for acknowledged bytes on the socket, and will > > generally block the sender for a significant amount of time while > > we could acknowledge more data, instead. > > > > Store the RTT reported by the kernel using an approximation (exponent), > > to keep flow storage size within two (typical) cachelines. Check for > > socket updates when half of this time elapses: it should be a good > > indication of the one-way delay we're interested in (peer to us). > > > > Representable values are between 100 us and 3.2768 s, and any value > > outside this range is clamped to these bounds. This choice appears > > to be a good trade-off between additional overhead and throughput. > > > > This mechanism partially overlaps with the "low RTT" destinations, > > which we use to infer that a socket is connected to an endpoint to > > the same machine (while possibly in a different namespace) if the > > RTT is reported as 10 us or less. > > > > This change doesn't, however, conflict with it: we are reading > > TCP_INFO parameters for local connections anyway, so we can always > > store the RTT approximation opportunistically. > > > > Then, if the RTT is "low", we don't really need a timer to > > acknowledge data as we'll always acknowledge everything to the > > sender right away. However, we have limited space in the array where > > we store addresses of local destination, so the low RTT property of a > > connection might toggle frequently. Because of this, it's actually > > helpful to always have the RTT approximation stored. > > > > This could probably benefit from a future rework, though, introducing > > a more integrated approach between these two mechanisms. > > > > Signed-off-by: Stefano Brivio > > --- > > tcp.c | 30 +++++++++++++++++++++++------- > > tcp_conn.h | 9 +++++++++ > > util.c | 14 ++++++++++++++ > > util.h | 1 + > > 4 files changed, 47 insertions(+), 7 deletions(-) > > > > diff --git a/tcp.c b/tcp.c > > index 28d3304..0167121 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -202,9 +202,13 @@ > > * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on > > * either side, the connection is reset > > * > > - * - ACK_INTERVAL elapsed after data segment received from tap without having > > + * - RTT / 2 elapsed after data segment received from tap without having > > * sent an ACK segment, or zero-sized window advertised to tap/guest (flag > > - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent > > + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. > > + * > > + * RTT, here, is an approximation of the RTT value reported by the kernel via > > + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to > > + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly. > > * > > * > > * Summary of data flows (with ESTABLISHED event) > > @@ -341,7 +345,6 @@ enum { > > #define MSS_DEFAULT 536 > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > -#define ACK_INTERVAL 10 /* ms */ > > #define RTO_INIT 1 /* s, RFC 6298 */ > > #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ > > #define FIN_TIMEOUT 60 > > @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > } > > > > if (conn->flags & ACK_TO_TAP_DUE) { > > - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; > > + it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000); > > + it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000; > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > int exp = conn->retries, timeout = RTO_INIT; > > if (!(conn->events & ESTABLISHED)) > > @@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > it.it_value.tv_sec = ACT_TIMEOUT; > > } > > > > - flow_dbg(conn, "timer expires in %llu.%03llus", > > - (unsigned long long)it.it_value.tv_sec, > > - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > > + if (conn->flags & ACK_TO_TAP_DUE) { > > + flow_trace(conn, "timer expires in %llu.%03llums", > > + (unsigned long)it.it_value.tv_sec * 1000 + > > + (unsigned long long)it.it_value.tv_nsec % > > + (1000 * 1000), > > + (unsigned long long)it.it_value.tv_nsec / 1000); > > This is the wrong way around. The ms part needs to be: Ouch, I... half swapped them. Almost. This risks haunting us for a bit. Luckily we have discrete values there so I could make this small table for --trace logs from the broken implementation: #include int main() { unsigned long rtt, n, s; printf("before fix\t\tafter fix\tRTT\n"); for (rtt = 100; rtt < 100 << 16; rtt <<= 1) { s = rtt / 2 / (1000 * 1000); n = rtt / 2 % (1000 * 1000) * 1000; printf("%llu.%03llu\t\t%lu.%02lu\t\t%lu\n", s * 1000 + n % (1000 * 1000), n / 1000, s * 1000 + n / (1000 * 1000), (n / 10000) % 100, rtt); } } before fix after fix RTT 50000.050 0.05 100 100000.100 0.10 200 200000.200 0.20 400 400000.400 0.40 800 800000.800 0.80 1600 600000.1600 1.60 3200 200000.3200 3.20 6400 400000.6400 6.40 12800 800000.12800 12.80 25600 600000.25600 25.60 51200 200000.51200 51.20 102400 400000.102400 102.40 204800 800000.204800 204.80 409600 600000.409600 409.60 819200 200000.819200 819.20 1638400 401000.638400 1638.40 3276800 > tv_sec * 1000 + tv_nsec / 1000000 > and the fractional (us) part: > (tv_nsec / 1000) % 1000 > > (or if you did just want a single digit after the ., then: > tv_nsec / 100000 % 10 > ) I guess I'll display two digits instead, I never had a ~100 ns RTT in my tests so I didn't notice until now that the resulting timeout would be displayed as 0.0ms with one fractional digit. Thanks for the snippet, I'll post a patch soon (of course, feel free to do so meanwhile if you already have a local change...). -- Stefano