On Wed, Oct 08, 2025 at 12:42:12AM +0200, Stefano Brivio wrote: > On Fri, 3 Oct 2025 16:30:51 +1000 > David Gibson wrote: > > > This is fairly complex, because we have a method we prefer but we need to > > fall back to a simpler one in a bunch of cases. Slightly reorganise the > > code to make the flow clearer, and add a large comment giving the > > rationale. > > I think this is a strict improvement on the original and I was about to > apply it regardless of my pending series with TCP fixes (it looks > completely independent to me) and a few nits I had, but then I noticed > one bit that might be substantially misleading, at the end. > > So here come all my comments: > > > Signed-off-by: David Gibson > > --- > > tcp.c | 68 ++++++++++++++++++++++++++++++++++++----------------------- > > 1 file changed, 42 insertions(+), 26 deletions(-) > > > > diff --git a/tcp.c b/tcp.c > > index 7da41797..85eb2c32 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1014,35 +1014,51 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > uint32_t new_wnd_to_tap = prev_wnd_to_tap; > > int s = conn->sock; > > > > - if (!bytes_acked_cap) { > > - conn->seq_ack_to_tap = conn->seq_from_tap; > > - if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > > - conn->seq_ack_to_tap = prev_ack_to_tap; > > - } else { > > - if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL || > > - tcp_rtt_dst_low(conn) || CONN_IS_CLOSING(conn) || > > - (conn->flags & LOCAL) || force_seq) { > > - conn->seq_ack_to_tap = conn->seq_from_tap; > > - } else if (conn->seq_ack_to_tap != conn->seq_from_tap) { > > - if (!tinfo) { > > - tinfo = &tinfo_new; > > - if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) > > - return 0; > > - } > > - > > - /* This trips a cppcheck bug in some versions, including > > - * cppcheck 2.18.3. > > - * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/ > > - */ > > - /* cppcheck-suppress [uninitvar,unmatchedSuppression] */ > > - conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked + > > - conn->seq_init_from_tap; > > - > > - if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > > - conn->seq_ack_to_tap = prev_ack_to_tap; > > + /* At this point we could ack all the data we've accepted for forwarding > > + * (seq_from_tap). When possible, however, we want to only ack what the > > + * peer has acked. This makes it appear to the guest more like a direct > > + * connection to the peer, and may improve flow control behaviour. > > For consistency, as we don't use "ack" as a verb anywhere else, maybe > spell it out as "acknowledge" / "acknowledged". We don't in comments, but there is TAP_FIN_ACKED and bytes_acked_cap (copied from the kernel's tcpi_bytes_acked). Adjusted, nonetheless. > > + * > > + * For it to be possible and worth it we need: > > + * - The TCP_INFO Linux extension which gives us the peer acked bytes > > + * - Not to be told not to (force_seq) > > + * - Not half-closed in the peer->guest direction > > + * With no data coming from the peer, we won't get further events > > + * which would prompt us to recheck bytes_acked. We could poll on > > + * a timer, but that's more trouble than it's worth. > > Strictly speaking, we could (and usually do) get further events > prompting us to check bytes_acked, in the form of segments from the > guest, but perhaps we can just leave this detail out for brevity, > unless you want to try and factor that in. Good point. I was thinking about the fact that we don't get events for the fact that bytes_acked has changed of itself, but the comment doesn't make that clear. I've tweaked the wording. > > + * - Not a host local connection > > The tcp_rtt_dst_low() is a trick to consider "local" also anything (VMs) > that's connected to us via veth. > > It's not local from a network segment perspective, but it's local to > the machine, and the same consideration applies (somewhat surprisingly, > for veth). Same here, I guess we could leave this out for brevity. I've adjusted the wording in a way I think is better, but it will want a re-review. > > > + * Data goes directly from socket to socket in this case, with > > + * nothing meaningful "in flight". > > + * - Large enough send buffer > > + * If this is small, there's not enough in flight to bother. > > + */ > > + if (bytes_acked_cap && !force_seq && > > + !CONN_IS_CLOSING(conn) && > > + !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) && > > + (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) { > > + if (!tinfo) { > > + tinfo = &tinfo_new; > > + if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) > > + return 0; > > } > > + > > + /* This trips a cppcheck bug in some versions, including > > + * cppcheck 2.18.3. > > + * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/ > > + */ > > + /* cppcheck-suppress [uninitvar,unmatchedSuppression] */ > > + conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked + > > + conn->seq_init_from_tap; > > Maybe fix the indentation while at it? > > conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked + > conn->seq_init_from_tap; Ah, sure. That detail of our style isn't known by my editor, so I missed it. > > + } else { > > + /* Fall back to acking everything we have */ > > Maybe specifically refer to what we got so far, > > /* Fall back to acknowledging everything we got */ > > ? Uh, sure, done. > > + conn->seq_ack_to_tap = conn->seq_from_tap; > > } > > > > + /* If the guest is retransmitting, don't let our ACKed sequence go > > + * backwards */ > > This is the misleading part I realised about, after I mentioned it in: > > https://archives.passt.top/passt-dev/20251007003219.3f286b1d@elisabeth/ > > ...the reason why we risk rewinding the acknowledged sequence isn't > that the guest is retransmitting, because in that case we wouldn't have > advanced conn->seq_to_tap to begin with. Right, I misunderstood the situation in which this logic is needed. > The reason is that one of those conditions for using bytes_acked you > listed above happened to be false, and now it becomes true again. Aha! > The only practical one I can think of is the array used by > tcp_rtt_dst_low() getting full at some point, but later we re-insert the > peer we're talking to in the table. Comment updated. > By the way, for consistency: > > /* Multi-line > * comment > */ Done. > > + if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > > + conn->seq_ack_to_tap = prev_ack_to_tap; > > The reason behind the current code structure is to skip this if we > didn't touch conn->seq_ack_to_tap at all, but the compiler will probably > figure this out by itself, and even if it doesn't, I guess it's more > efficient to do this unconditionally anyway. That was my thinking. It should all be nearly-free in-register integer arithmetic, plus a single probably-L1-hot store. This way makes for less conditionals to parse, primarily for the human reader though also for the CPU. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson