From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=BUTZMUqt; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 702E75A026F for ; Wed, 08 Oct 2025 00:42:20 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1759876939; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=tuMUcz1J52et5J0zjS+T4KxLvaUfWrHljDjh+fxa+3A=; b=BUTZMUqtd2hGQmGQreLtUFXTNAqUEA34T8b6SPAoBE7C8DB8B7hOHFgg0hIyKRYmMVbINq P6yFpOyEIM1PtTILIwTbS8kQV+nhAaikyEH1G9wUC/7oV1tk2lzp7z1fH5FgLjM3TSmyvg ur39BJCtYDkUGazYfUoB9zZdXeP6ALs= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-488-2MViElouNrySNeiHX0TwKw-1; Tue, 07 Oct 2025 18:42:17 -0400 X-MC-Unique: 2MViElouNrySNeiHX0TwKw-1 X-Mimecast-MFC-AGG-ID: 2MViElouNrySNeiHX0TwKw_1759876937 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-3e997eb7232so2829621f8f.3 for ; Tue, 07 Oct 2025 15:42:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1759876936; x=1760481736; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=tuMUcz1J52et5J0zjS+T4KxLvaUfWrHljDjh+fxa+3A=; b=u9FL5LCUfn83BaNIen3zGigODLxzw9LL92iC3VigOojrdM1SqLZpXGyMvPVP+2S4Rr wvSNlzS1M3pgYdF0TojcQUPgIcFamkzSHnnPGWdIRYT40e47GxfyFQQhsrJHDodhmD+3 cuLcTvTYjUYsiMjtvRrKb229LkISTrHpe7Ih3zP+2MwvIhZLah6B4CnhWBJOS/CorUkb tWh5Babs6ltZQPW9I0ZykkCZrxh24+qEXMjC+Ibr8QI7ot/h+grDKFOx0JeQKi9NJ2Yg PVZ9sGMWDKUpokRVGgfzE68NKwM/NBbHHGNudYNJLv2ngXKYIA925+jFO+h++Jg7UKXt STDw== X-Gm-Message-State: AOJu0Yze+kf/X8SLrfIz9Y0sAYKViNhMJi9JlbKdkfTX9uqfO+FOCKPn 9PSPz42vHk/XqE7fk/aD5XwAEhKG2ecV6v695/hqlHNqwnw6QAmzfMYL/ViYz9xBU0B0FcOY1hB GOAQZn0XLfsDEU537nsmDl6XRGiWE1mbg/mWuYrlGFcQvj/8T5oinj/BngKr4Zg== X-Gm-Gg: ASbGncuQn3dihYwab5/Jj/TlgWEXYiE+b95YaVBTVGfHG2V1LDoV5u2RblV4/sgOy8Q dSNRSBoAOxEU1SmCE4QEWyaIClNMA9pQSOrc1394+C79VJmfe4lpnxF72wzNsdIXAbir5m3gle6 7wb7Dms4b7q6phX2XnvH570c1NQPCGBt1lzFuJNJh2n3n5KplEMdUfqJpsBs74CARk9N94j/q9Y 4gldle8CUipbaxt1ZYq7TRCuCOl6RfeBhcPgYJxTRWfi7hs1BwTMnQ+a+XL2utqGGS+bessXj1H ZXU0MK/WfbVIBJzhbTPbcGiRNT8ISqBGVdJkqPpBHWqpbSGvH+cPuvjj X-Received: by 2002:a05:6000:2c0e:b0:3df:c5e3:55fe with SMTP id ffacd0b85a97d-4266e7dff05mr565895f8f.29.1759876935999; Tue, 07 Oct 2025 15:42:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGxzjOoe8xNYoOBb87Bmu+gUfR4IWuidSguSylwfU/qgN0mQ6rsNhhMmc4rldhGCAe/x2OcLg== X-Received: by 2002:a05:6000:2c0e:b0:3df:c5e3:55fe with SMTP id ffacd0b85a97d-4266e7dff05mr565879f8f.29.1759876935039; Tue, 07 Oct 2025 15:42:15 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-4255d8f45e9sm27571145f8f.51.2025.10.07.15.42.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Oct 2025 15:42:14 -0700 (PDT) Date: Wed, 8 Oct 2025 00:42:12 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack Message-ID: <20251008004212.25d0d0dc@elisabeth> In-Reply-To: <20251003063051.1127873-2-david@gibson.dropbear.id.au> References: <20251003063051.1127873-1-david@gibson.dropbear.id.au> <20251003063051.1127873-2-david@gibson.dropbear.id.au> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: OgJsLaoxwMtM53goe0WJAUC2M4SaH940ogpB4VlgAl8_1759876937 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: EX4OYJLYB24LXSEY2YEQFRWJFS6FLYIO X-Message-ID-Hash: EX4OYJLYB24LXSEY2YEQFRWJFS6FLYIO X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 3 Oct 2025 16:30:51 +1000 David Gibson wrote: > This is fairly complex, because we have a method we prefer but we need to > fall back to a simpler one in a bunch of cases. Slightly reorganise the > code to make the flow clearer, and add a large comment giving the > rationale. I think this is a strict improvement on the original and I was about to apply it regardless of my pending series with TCP fixes (it looks completely independent to me) and a few nits I had, but then I noticed one bit that might be substantially misleading, at the end. So here come all my comments: > Signed-off-by: David Gibson > --- > tcp.c | 68 ++++++++++++++++++++++++++++++++++++----------------------- > 1 file changed, 42 insertions(+), 26 deletions(-) > > diff --git a/tcp.c b/tcp.c > index 7da41797..85eb2c32 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1014,35 +1014,51 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > uint32_t new_wnd_to_tap = prev_wnd_to_tap; > int s = conn->sock; > > - if (!bytes_acked_cap) { > - conn->seq_ack_to_tap = conn->seq_from_tap; > - if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > - conn->seq_ack_to_tap = prev_ack_to_tap; > - } else { > - if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL || > - tcp_rtt_dst_low(conn) || CONN_IS_CLOSING(conn) || > - (conn->flags & LOCAL) || force_seq) { > - conn->seq_ack_to_tap = conn->seq_from_tap; > - } else if (conn->seq_ack_to_tap != conn->seq_from_tap) { > - if (!tinfo) { > - tinfo = &tinfo_new; > - if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) > - return 0; > - } > - > - /* This trips a cppcheck bug in some versions, including > - * cppcheck 2.18.3. > - * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/ > - */ > - /* cppcheck-suppress [uninitvar,unmatchedSuppression] */ > - conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked + > - conn->seq_init_from_tap; > - > - if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > - conn->seq_ack_to_tap = prev_ack_to_tap; > + /* At this point we could ack all the data we've accepted for forwarding > + * (seq_from_tap). When possible, however, we want to only ack what the > + * peer has acked. This makes it appear to the guest more like a direct > + * connection to the peer, and may improve flow control behaviour. For consistency, as we don't use "ack" as a verb anywhere else, maybe spell it out as "acknowledge" / "acknowledged". > + * > + * For it to be possible and worth it we need: > + * - The TCP_INFO Linux extension which gives us the peer acked bytes > + * - Not to be told not to (force_seq) > + * - Not half-closed in the peer->guest direction > + * With no data coming from the peer, we won't get further events > + * which would prompt us to recheck bytes_acked. We could poll on > + * a timer, but that's more trouble than it's worth. Strictly speaking, we could (and usually do) get further events prompting us to check bytes_acked, in the form of segments from the guest, but perhaps we can just leave this detail out for brevity, unless you want to try and factor that in. > + * - Not a host local connection The tcp_rtt_dst_low() is a trick to consider "local" also anything (VMs) that's connected to us via veth. It's not local from a network segment perspective, but it's local to the machine, and the same consideration applies (somewhat surprisingly, for veth). Same here, I guess we could leave this out for brevity. > + * Data goes directly from socket to socket in this case, with > + * nothing meaningful "in flight". > + * - Large enough send buffer > + * If this is small, there's not enough in flight to bother. > + */ > + if (bytes_acked_cap && !force_seq && > + !CONN_IS_CLOSING(conn) && > + !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) && > + (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) { > + if (!tinfo) { > + tinfo = &tinfo_new; > + if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) > + return 0; > } > + > + /* This trips a cppcheck bug in some versions, including > + * cppcheck 2.18.3. > + * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/ > + */ > + /* cppcheck-suppress [uninitvar,unmatchedSuppression] */ > + conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked + > + conn->seq_init_from_tap; Maybe fix the indentation while at it? conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked + conn->seq_init_from_tap; > + } else { > + /* Fall back to acking everything we have */ Maybe specifically refer to what we got so far, /* Fall back to acknowledging everything we got */ ? > + conn->seq_ack_to_tap = conn->seq_from_tap; > } > > + /* If the guest is retransmitting, don't let our ACKed sequence go > + * backwards */ This is the misleading part I realised about, after I mentioned it in: https://archives.passt.top/passt-dev/20251007003219.3f286b1d@elisabeth/ ...the reason why we risk rewinding the acknowledged sequence isn't that the guest is retransmitting, because in that case we wouldn't have advanced conn->seq_to_tap to begin with. The reason is that one of those conditions for using bytes_acked you listed above happened to be false, and now it becomes true again. The only practical one I can think of is the array used by tcp_rtt_dst_low() getting full at some point, but later we re-insert the peer we're talking to in the table. By the way, for consistency: /* Multi-line * comment */ > + if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > + conn->seq_ack_to_tap = prev_ack_to_tap; The reason behind the current code structure is to skip this if we didn't touch conn->seq_ack_to_tap at all, but the compiler will probably figure this out by itself, and even if it doesn't, I guess it's more efficient to do this unconditionally anyway. > + > if (!snd_wnd_cap) { > tcp_get_sndbuf(conn); > new_wnd_to_tap = MIN(SNDBUF_GET(conn), MAX_WINDOW); -- Stefano