From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=G9wb8EFX; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 073495A0619 for ; Mon, 20 Oct 2025 07:11:17 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1760937077; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NXwCICJHRdm1xblF3onE7rPX53FFP7yF5AiN5Gd5CSs=; b=G9wb8EFXu+sjunfY/3q5EyVM1Vqc22cILTlLUsT0BJdt9RTA4KjFwAUrDEBgckAqz8FcqV aEtwuYphIMyAvEdYk4tbOnkF6p5gnp7z0/uqQWTneqlWwx+9YWAYGZXThaRK/H2TlkjOl7 3zEzJG1WQ88cyOCRCxdA5/8N36L7Ojw= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-377-95qBDw8_P7y9TDNRmMofug-1; Mon, 20 Oct 2025 01:11:15 -0400 X-MC-Unique: 95qBDw8_P7y9TDNRmMofug-1 X-Mimecast-MFC-AGG-ID: 95qBDw8_P7y9TDNRmMofug_1760937074 Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-4711899ab0aso29364465e9.2 for ; Sun, 19 Oct 2025 22:11:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1760937074; x=1761541874; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=P5zyMrrG/ShVjL8bOaiscl8IbuUvzdp0vNH8zefWliA=; b=eEMI1t2FSaeYoxjHbXh+SRTQETX351vFbvpusaWmv8OMxI8VoC1Id2tBcD2sQbmBd3 GECixxfmBZfQDjALXGuIavhcRGgCJwK59q9UKTuYhn/jGNPNSpAK1NMsrasdmxT9sJXp D58OPHEqDqkgHX4ULclUwK0Ub6fWvZAIniYU5ZjalnEfWOb8/MRWzAwD05n1Kuuv6jwH w1r/RhYUMnp4wOL6iEXcW5l1OhslQbPEXGgzEHr8fNTzH9BdPwMNbCWotw1emD/zdcM7 xCNQuFjGhcH1liSbi5OG0rVsw89eHJRLRUfFtzKJ67CxIM9928f2/4+pZHEIiUqOlW5a TigQ== X-Forwarded-Encrypted: i=1; AJvYcCXrw4YvUv91tLsDYgQc0ueYdfhGQAp2763x+knyoG2DDe+sBCEC1Ezu2g2zCR2x8/W8pL8VsHvKgmw=@passt.top X-Gm-Message-State: AOJu0YwYY2ZCwXOwsSJcQ7pM6yUZ+TA8uaBWrWhArhjhqevK0Xr8oggc jyTS2m734AS4a/Ho8Cz8bjH+/1bHxROoLPZNUodGN3hRPPkGXa8zgenc65WG4eMIZEXWOhKecDL IDv7J1nXE2Yur1rxdRypz0uQKmIud3+2Wm7msxjJYVQ7J4Ijiz0lOpg== X-Gm-Gg: ASbGnctcO88zNORzFr0A16bA31gzhaTYb9zAsEQK6upL0xLAF13yiH+sjFCXSdiN0N4 ihyeP/AFRTZw4tY31Ttqz3smFWKlaLzyn7EavPNuSMSv7tg7CfyRHBfeenUR5NtYxECvlbb01Ys zkHNEaTGbzM0jJGnBSRBBYjH+vwFQPnOWbpmb7opkTcJj1feIM1witnGzUfc2ghIMfFTNlKy2o3 XlHxALyk9LzcpJKGWCjhP2q1+UzsNeFCMCJLjaimv6bMjdbCuJMVw+AsNIy0T9Q7qulLuxUbOKv EoqBpr7JMbsObjdwLsqWRTODJpBMEl5Xh1g4gJ1wIxT1UwhrYXYyHCksZi4n2fOjYHEJKm3ghcK P0tvO5L9tL5Vu0jacrBybk0ocRuk= X-Received: by 2002:a05:600c:1da7:b0:46e:4be1:a423 with SMTP id 5b1f17b1804b1-4711786d630mr73945815e9.1.1760937073637; Sun, 19 Oct 2025 22:11:13 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGD3khGSAry1BTmhy7iScDiSwXRDN56gNjKnKvaFG4oQ7/B2kxG59aeiUxvD8EVNjSUG+/x2g== X-Received: by 2002:a05:600c:1da7:b0:46e:4be1:a423 with SMTP id 5b1f17b1804b1-4711786d630mr73945585e9.1.1760937072943; Sun, 19 Oct 2025 22:11:12 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4710cd4480fsm108871955e9.1.2025.10.19.22.11.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Oct 2025 22:11:12 -0700 (PDT) Date: Mon, 20 Oct 2025 07:11:07 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v3 4/4] tcp: Update data retransmission timeout Message-ID: <20251020071107.42fd40e9@elisabeth> In-Reply-To: References: <20251014073836.18150-1-yuhuang@redhat.com> <20251014073836.18150-5-yuhuang@redhat.com> <20251017202812.173e9352@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: tSlsWL69bIXvaXpMjTA2Qt7bYPUVdGPfHvJnTapFB-0_1760937074 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: QZ3HX5UKPHNI73BJFMXK5YWFDZ7S4CBY X-Message-ID-Hash: QZ3HX5UKPHNI73BJFMXK5YWFDZ7S4CBY X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Yumei Huang , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 20 Oct 2025 11:20:19 +1100 David Gibson wrote: > On Fri, Oct 17, 2025 at 08:28:12PM +0200, Stefano Brivio wrote: > > On Thu, 16 Oct 2025 09:54:25 +1100 > > David Gibson wrote: > > =20 > > > On Wed, Oct 15, 2025 at 02:31:27PM +0800, Yumei Huang wrote: =20 > > > > On Wed, Oct 15, 2025 at 8:05=E2=80=AFAM David Gibson > > > > wrote: =20 > > > > > > > > > > On Tue, Oct 14, 2025 at 03:38:36PM +0800, Yumei Huang wrote: = =20 > > > > > > According to RFC 2988 and RFC 6298, we should use an exponentia= l > > > > > > backoff timeout for data retransmission starting from one secon= d > > > > > > (see Appendix A in RFC 6298), and limit it to about 60 seconds > > > > > > as allowed by the same RFC: > > > > > > > > > > > > (2.5) A maximum value MAY be placed on RTO provided it is at > > > > > > least 60 seconds. =20 > > > > > > > > > > The interpretation of this isn't entirely clear to me. Does it m= ean > > > > > if the total retransmit delay exceeds 60s we give up and RST (wha= t > > > > > this patch implements)? Or does it mean that if the retransmit d= elay > > > > > reaches 60s we keep retransmitting, but don't increase the delay = any > > > > > further? > > > > > > > > > > Looking at tcp_bound_rto() and related code in the kernel suggest= s the > > > > > second interpretation. > > > > > =20 > > > > > > Combine the macros defining the initial timeout for both SYN an= d ACK. > > > > > > And add a macro ACK_RETRIES to limit the total timeout to about= 60s. > > > > > > > > > > > > Signed-off-by: Yumei Huang > > > > > > --- > > > > > > tcp.c | 32 ++++++++++++++++---------------- > > > > > > 1 file changed, 16 insertions(+), 16 deletions(-) > > > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > > index 3ce3991..84da069 100644 > > > > > > --- a/tcp.c > > > > > > +++ b/tcp.c > > > > > > @@ -179,16 +179,12 @@ > > > > > > * > > > > > > * Timeouts are implemented by means of timerfd timers, set ba= sed on flags: > > > > > > * > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest du= ring handshake > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within = this time, resend > > > > > > - * SYN. It's the starting timeout for the first SYN retry. I= f this persists > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connec= tion > > > > > > - * > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/gues= t, after sending > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-s= end data from the > > > > > > - * socket and reset sequence to what was acknowledged. If th= is persists for > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the conne= ction > > > > > > + * - ACK_TIMEOUT_INIT: if no ACK segment was received from tap= /guest, eiher > > > > > > + * during handshake(flag ACK_FROM_TAP_DUE without ESTABLISHE= D event) or after > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED even= t), re-send data > > > > > > + * from the socket and reset sequence to what was acknowledg= ed. It's the > > > > > > + * starting timeout for the first retry. If this persists fo= r more than > > > > > > + * allowed times in a row, reset the connection > > > > > > * > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag= ACK_FROM_TAP_DUE > > > > > > * with TAP_FIN_SENT event), and no ACK is received within t= his time, reset > > > > > > @@ -342,8 +338,7 @@ enum { > > > > > > #define WINDOW_DEFAULT 14600 /= * RFC 6928 */ > > > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > > > > > -#define ACK_TIMEOUT 2 > > > > > > +#define ACK_TIMEOUT_INIT 1 /* s, RFC= 6298 */ =20 > > > > > > > > > > I'd suggest calling this RTO_INIT to match the terminology used i= n the > > > > > RFCs. =20 > > > >=20 > > > > Sure. =20 > > > > > =20 > > > > > > #define FIN_TIMEOUT 60 > > > > > > #define ACT_TIMEOUT 7200 > > > > > > > > > > > > @@ -352,6 +347,11 @@ enum { > > > > > > > > > > > > #define ACK_IF_NEEDED 0 /* See tcp_send_f= lag() */ > > > > > > > > > > > > +/* Number of retries calculated from the exponential backoff f= ormula, limited > > > > > > + * by a total timeout of about 60 seconds. > > > > > > + */ > > > > > > +#define ACK_RETRIES 5 > > > > > > + =20 > > > > > > > > > > As noted above, I think this is based on a misunderstanding of wh= at > > > > > the RFC is saying. TCP_MAX_RETRIES should be fine as it is, I th= ink. > > > > > We could implement the clamping of the RTO, but it's a "MAY" in t= he > > > > > RFC, so we don't have to, and I don't really see a strong reason = to do > > > > > so. =20 > > > >=20 > > > > If we use TCP_MAX_RETRIES and not clamping RTO, the total timeout > > > > could be 255 seconds. > > > >=20 > > > > Stefano mentioned "Retransmitting data after 256 seconds doesn't ma= ke > > > > a lot of sense to me" in the previous comment. =20 > > >=20 > > > That's true, but it's pretty much true for 60s as well. For the loca= l > > > link we usually have between passt and guest, even 1s is an eternity.= =20 > >=20 > > Rather than the local link I was thinking of whatever monitor or > > liveness probe in KubeVirt which might have a 60-second period, or some > > firewall agent, or how long it typically takes for guests to stop and > > resume again in KubeVirt. =20 >=20 > Right, I hadn't considered those. Although.. do those actually re-use > a single connection? I would have guessed they use a new connection > each time, making the timeouts here irrelevant. It depends on the definition of "each time", because we don't time out host-side connections immediately. Pretending passt isn't there, the timeout would come from the default values for TCP connections. It looks like there's no specific SO_SNDTIMEO value set for those probes, and you can't configure the timeout, at least according to: https://kubernetes.io/docs/tasks/configure-pod-container/configure-livene= ss-readiness-startup-probes/#define-a-tcp-liveness-probe and for tcp_syn_retries, tcp(7) says: The default value is 6, which corresponds to retrying for up to approximately 127 seconds. In this series, to make things transparent, we read out those values, so that part is fine. But does the Linux kernel clamp the RTO? It turns out that yes, it does, TCP_RTO_MAX_SEC is 120 seconds (before 1280c26228bd ("tcp: add tcp_rto_max_ms sysctl") that was TCP_RTO_MAX, same value), and it's used by tcp_retransmit_timer() via tcp_rto_max(). That change makes it configurable. I'm tempted to suggest that we should read out that value as well (with a 120-second fallback for older kernels) to make our behaviour as transparent as possible. It's slightly more complicated and perhaps not strictly needed, but we've been bitten a few times by cases where applications and users expect us to behave like the Linux kernel, and we didn't... so maybe we could do this as well while at it? Given the rest of this series, it looks like a relatively small addition to it. > > It's usually seconds or maybe minutes but not five minutes. > > =20 > > > Basically I see no harm, but also no advantage to clamping or limitin= g > > > the RTO, so I'm suggesting going with the simplest code. =20 > >=20 > > The advantage I see is that we'll recover significantly faster in case > > something went wrong. =20 >=20 > That's a fair point in a more general case. >=20 > > > Note that there are (rare) situations where we could get a response > > > after minutes. > > > - The interface on the guest was disabled for a while > > > - An error in guest firewall configuration blocked packets for a whi= le > > > - A bug on the guest cause the kernel to wedge for a while > > > - The user manually suspended the guest for a while (VM/passt only) > > >=20 > > > These generally indicate something has gone fairly badly wrong, but a > > > long RTO gives the user a bit more time to realise their mistake and > > > fix things. =20 > >=20 > > True, it's just that to me five minutes sounds like "broken beyond > > repair", while one minute sounds like "oh we tried again and it worked"= . =20 >=20 > Eh, maybe. By nature it's always going to be a bit arbitrary. >=20 > > > These are niche cases, but given the cost of implementing > > > it is "do nothing"... =20 > >=20 > > ...anyway, it's not a strong preference from my side. It's mostly about > > experience but I won't be able to really come up with obvious evidence > > (at least not quickly), so if the code is significantly simpler... > > whatever. It's not provable so I won't insist. =20 >=20 > It's a bit simpler, I'm not sure I'd go so far as "significantly". >=20 > > Note: the comments I'm replying to are from yesterday / Thursday, on > > v3, and today / Friday we're at v6. I don't expect a week grace period > > as you would on the kernel: > >=20 > > https://docs.kernel.org/process/submitting-patches.html#don-t-get-dis= couraged-or-impatient > >=20 > > because we can surely move faster than that, but three versions in a > > day obviously before I get any chance to have a look means a > > substantial overhead for me, and I might miss the meaning and context o= f > > comments of other reviewers (David in this case). There are no > > changelogs in cover letters either. > >=20 > > I plan to skip to v6 but don't expect a review soon, because of that > > overhead I just mentioned. =20 --=20 Stefano