public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Yumei Huang <yuhuang@redhat.com>, passt-dev@passt.top
Subject: Re: [PATCH v3 4/4] tcp: Update data retransmission timeout
Date: Mon, 20 Oct 2025 07:11:07 +0200	[thread overview]
Message-ID: <20251020071107.42fd40e9@elisabeth> (raw)
In-Reply-To: <aPWAQ4F-DyWyjVg9@zatzit>

On Mon, 20 Oct 2025 11:20:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Oct 17, 2025 at 08:28:12PM +0200, Stefano Brivio wrote:
> > On Thu, 16 Oct 2025 09:54:25 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Oct 15, 2025 at 02:31:27PM +0800, Yumei Huang wrote:  
> > > > On Wed, Oct 15, 2025 at 8:05 AM David Gibson
> > > > <david@gibson.dropbear.id.au> wrote:    
> > > > >
> > > > > On Tue, Oct 14, 2025 at 03:38:36PM +0800, Yumei Huang wrote:    
> > > > > > According to RFC 2988 and RFC 6298, we should use an exponential
> > > > > > backoff timeout for data retransmission starting from one second
> > > > > > (see Appendix A in RFC 6298), and limit it to about 60 seconds
> > > > > > as allowed by the same RFC:
> > > > > >
> > > > > >    (2.5) A maximum value MAY be placed on RTO provided it is at
> > > > > >          least 60 seconds.    
> > > > >
> > > > > The interpretation of this isn't entirely clear to me.  Does it mean
> > > > > if the total retransmit delay exceeds 60s we give up and RST (what
> > > > > this patch implements)?  Or does it mean that if the retransmit delay
> > > > > reaches 60s we keep retransmitting, but don't increase the delay any
> > > > > further?
> > > > >
> > > > > Looking at tcp_bound_rto() and related code in the kernel suggests the
> > > > > second interpretation.
> > > > >    
> > > > > > Combine the macros defining the initial timeout for both SYN and ACK.
> > > > > > And add a macro ACK_RETRIES to limit the total timeout to about 60s.
> > > > > >
> > > > > > Signed-off-by: Yumei Huang <yuhuang@redhat.com>
> > > > > > ---
> > > > > >  tcp.c | 32 ++++++++++++++++----------------
> > > > > >  1 file changed, 16 insertions(+), 16 deletions(-)
> > > > > >
> > > > > > diff --git a/tcp.c b/tcp.c
> > > > > > index 3ce3991..84da069 100644
> > > > > > --- a/tcp.c
> > > > > > +++ b/tcp.c
> > > > > > @@ -179,16 +179,12 @@
> > > > > >   *
> > > > > >   * Timeouts are implemented by means of timerfd timers, set based on flags:
> > > > > >   *
> > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
> > > > > > - *   (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend
> > > > > > - *   SYN. It's the starting timeout for the first SYN retry. If this persists
> > > > > > - *   for more than TCP_MAX_RETRIES or (tcp_syn_retries +
> > > > > > - *   tcp_syn_linear_timeouts) times in a row, reset the connection
> > > > > > - *
> > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending
> > > > > > - *   data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the
> > > > > > - *   socket and reset sequence to what was acknowledged. If this persists for
> > > > > > - *   more than TCP_MAX_RETRIES times in a row, reset the connection
> > > > > > + * - ACK_TIMEOUT_INIT: if no ACK segment was received from tap/guest, eiher
> > > > > > + *   during handshake(flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after
> > > > > > + *   sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data
> > > > > > + *   from the socket and reset sequence to what was acknowledged. It's the
> > > > > > + *   starting timeout for the first retry. If this persists for more than
> > > > > > + *   allowed times in a row, reset the connection
> > > > > >   *
> > > > > >   * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE
> > > > > >   *   with TAP_FIN_SENT event), and no ACK is received within this time, reset
> > > > > > @@ -342,8 +338,7 @@ enum {
> > > > > >  #define WINDOW_DEFAULT                       14600           /* RFC 6928 */
> > > > > >
> > > > > >  #define ACK_INTERVAL                 10              /* ms */
> > > > > > -#define SYN_TIMEOUT_INIT             1               /* s */
> > > > > > -#define ACK_TIMEOUT                  2
> > > > > > +#define ACK_TIMEOUT_INIT             1               /* s, RFC 6298 */    
> > > > >
> > > > > I'd suggest calling this RTO_INIT to match the terminology used in the
> > > > > RFCs.    
> > > > 
> > > > Sure.    
> > > > >    
> > > > > >  #define FIN_TIMEOUT                  60
> > > > > >  #define ACT_TIMEOUT                  7200
> > > > > >
> > > > > > @@ -352,6 +347,11 @@ enum {
> > > > > >
> > > > > >  #define ACK_IF_NEEDED        0               /* See tcp_send_flag() */
> > > > > >
> > > > > > +/* Number of retries calculated from the exponential backoff formula, limited
> > > > > > + * by a total timeout of about 60 seconds.
> > > > > > + */
> > > > > > +#define ACK_RETRIES          5
> > > > > > +    
> > > > >
> > > > > As noted above, I think this is based on a misunderstanding of what
> > > > > the RFC is saying.  TCP_MAX_RETRIES should be fine as it is, I think.
> > > > > We could implement the clamping of the RTO, but it's a "MAY" in the
> > > > > RFC, so we don't have to, and I don't really see a strong reason to do
> > > > > so.    
> > > > 
> > > > If we use TCP_MAX_RETRIES and not clamping RTO, the total timeout
> > > > could be 255 seconds.
> > > > 
> > > > Stefano mentioned "Retransmitting data after 256 seconds doesn't make
> > > > a lot of sense to me" in the previous comment.    
> > > 
> > > That's true, but it's pretty much true for 60s as well.  For the local
> > > link we usually have between passt and guest, even 1s is an eternity.  
> > 
> > Rather than the local link I was thinking of whatever monitor or
> > liveness probe in KubeVirt which might have a 60-second period, or some
> > firewall agent, or how long it typically takes for guests to stop and
> > resume again in KubeVirt.  
> 
> Right, I hadn't considered those.  Although.. do those actually re-use
> a single connection?  I would have guessed they use a new connection
> each time, making the timeouts here irrelevant.

It depends on the definition of "each time", because we don't time out
host-side connections immediately.

Pretending passt isn't there, the timeout would come from the default
values for TCP connections. It looks like there's no specific
SO_SNDTIMEO value set for those probes, and you can't configure the
timeout, at least according to:

  https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-tcp-liveness-probe

and for tcp_syn_retries, tcp(7) says:

  The default value is 6, which corresponds to retrying for up to
  approximately 127 seconds.

In this series, to make things transparent, we read out those values,
so that part is fine. But does the Linux kernel clamp the RTO?

It turns out that yes, it does, TCP_RTO_MAX_SEC is 120 seconds (before
1280c26228bd ("tcp: add tcp_rto_max_ms sysctl") that was TCP_RTO_MAX,
same value), and it's used by tcp_retransmit_timer() via tcp_rto_max().
That change makes it configurable.

I'm tempted to suggest that we should read out that value as well
(with a 120-second fallback for older kernels) to make our behaviour
as transparent as possible.

It's slightly more complicated and perhaps not strictly needed, but
we've been bitten a few times by cases where applications and users
expect us to behave like the Linux kernel, and we didn't... so maybe
we could do this as well while at it? Given the rest of this series,
it looks like a relatively small addition to it.

> > It's usually seconds or maybe minutes but not five minutes.
> >   
> > > Basically I see no harm, but also no advantage to clamping or limiting
> > > the RTO, so I'm suggesting going with the simplest code.  
> > 
> > The advantage I see is that we'll recover significantly faster in case
> > something went wrong.  
> 
> That's a fair point in a more general case.
> 
> > > Note that there are (rare) situations where we could get a response
> > > after minutes.
> > >  - The interface on the guest was disabled for a while
> > >  - An error in guest firewall configuration blocked packets for a while
> > >  - A bug on the guest cause the kernel to wedge for a while
> > >  - The user manually suspended the guest for a while (VM/passt only)
> > > 
> > > These generally indicate something has gone fairly badly wrong, but a
> > > long RTO gives the user a bit more time to realise their mistake and
> > > fix things.  
> > 
> > True, it's just that to me five minutes sounds like "broken beyond
> > repair", while one minute sounds like "oh we tried again and it worked".  
> 
> Eh, maybe.  By nature it's always going to be a bit arbitrary.
> 
> > > These are niche cases, but given the cost of implementing
> > > it is "do nothing"...  
> > 
> > ...anyway, it's not a strong preference from my side. It's mostly about
> > experience but I won't be able to really come up with obvious evidence
> > (at least not quickly), so if the code is significantly simpler...
> > whatever. It's not provable so I won't insist.  
> 
> It's a bit simpler, I'm not sure I'd go so far as "significantly".
> 
> > Note: the comments I'm replying to are from yesterday / Thursday, on
> > v3, and today / Friday we're at v6. I don't expect a week grace period
> > as you would on the kernel:
> > 
> >   https://docs.kernel.org/process/submitting-patches.html#don-t-get-discouraged-or-impatient
> > 
> > because we can surely move faster than that, but three versions in a
> > day obviously before I get any chance to have a look means a
> > substantial overhead for me, and I might miss the meaning and context of
> > comments of other reviewers (David in this case). There are no
> > changelogs in cover letters either.
> > 
> > I plan to skip to v6 but don't expect a review soon, because of that
> > overhead I just mentioned.  

-- 
Stefano


  reply	other threads:[~2025-10-20  5:11 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-14  7:38 [PATCH v3 0/4] Retry SYNs for inbound connections Yumei Huang
2025-10-14  7:38 ` [PATCH v3 1/4] tcp: Rename "retrans" to "retries" Yumei Huang
2025-10-14 22:50   ` David Gibson
2025-10-15  2:17     ` Yumei Huang
2025-10-14  7:38 ` [PATCH v3 2/4] util: Introduce read_file() and read_file_long() function Yumei Huang
2025-10-14 23:27   ` David Gibson
2025-10-15  3:50     ` Yumei Huang
2025-10-15  4:46       ` David Gibson
2025-10-15  5:46         ` Yumei Huang
2025-10-28 23:12         ` Stefano Brivio
2025-10-29  0:43           ` David Gibson
2025-10-29  4:43             ` Stefano Brivio
2025-10-29  9:35               ` David Gibson
2025-10-29 16:23                 ` Stefano Brivio
2025-10-14  7:38 ` [PATCH v3 3/4] tcp: Resend SYN for inbound connections Yumei Huang
2025-10-14 23:40   ` David Gibson
2025-10-14  7:38 ` [PATCH v3 4/4] tcp: Update data retransmission timeout Yumei Huang
2025-10-15  0:05   ` David Gibson
2025-10-15  6:31     ` Yumei Huang
2025-10-15 22:54       ` David Gibson
2025-10-17 18:28         ` Stefano Brivio
2025-10-20  0:20           ` David Gibson
2025-10-20  5:11             ` Stefano Brivio [this message]
2025-10-20  9:17               ` David Gibson
2025-10-28 23:13                 ` Stefano Brivio
2025-10-29  0:35                   ` David Gibson
2025-10-29  4:52                     ` Stefano Brivio
2025-10-29  9:37                       ` David Gibson
2025-10-20 10:57           ` Yumei Huang
2025-10-20 23:20             ` Stefano Brivio
2025-10-22  2:23               ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251020071107.42fd40e9@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=yuhuang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).