On Mon, Dec 08, 2025 at 08:25:29AM +0100, Stefano Brivio wrote: > On Mon, 8 Dec 2025 16:54:55 +1100 > David Gibson wrote: > > > On Mon, Dec 08, 2025 at 01:22:13AM +0100, Stefano Brivio wrote: > > > ...instead of checking if the current sending buffer is less than > > > SNDBUF_SMALL, because this isn't simply an optimisation to coalesce > > > ACK segments: we rely on having enough data at once from the sender > > > to make the buffer grow by means of TCP buffer size tuning > > > implemented in the Linux kernel. > > > > > > This is important if we're trying to maximise throughput, but not > > > desirable for interactive traffic, where we want to be transparent as > > > possible and avoid introducing unnecessary latency. > > > > > > Use the tcpi_delivery_rate field reported by the Linux kernel, if > > > available, to calculate the current bandwidth-delay product: if it's > > > significantly smaller than the available sending buffer, conclude that > > > we're not bandwidth-bound and this is likely to be interactive > > > traffic, so acknowledge data only as it's acknowledged by the peer. > > > > > > Conversely, if the bandwidth-delay product is comparable to the size > > > of the sending buffer (more than 5%), we're probably bandwidth-bound > > > or... bound to be: acknowledge everything in that case. > > > > Ah, nice. This reasoning is much clearer to me than the previous > > spin. > > > > > > > > Signed-off-by: Stefano Brivio > > > --- > > > tcp.c | 45 +++++++++++++++++++++++++++++++++------------ > > > 1 file changed, 33 insertions(+), 12 deletions(-) > > > > > > diff --git a/tcp.c b/tcp.c > > > index 9bf7b8b..533c8a7 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -353,6 +353,9 @@ enum { > > > #define LOW_RTT_TABLE_SIZE 8 > > > #define LOW_RTT_THRESHOLD 10 /* us */ > > > > > > +/* Ratio of buffer to bandwidth * delay product implying interactive traffic */ > > > +#define SNDBUF_TO_BW_DELAY_INTERACTIVE /* > */ 20 /* (i.e. < 5% of buffer) */ > > > + > > > #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ > > > > > > #define CONN_IS_CLOSING(conn) \ > > > @@ -426,11 +429,13 @@ socklen_t tcp_info_size; > > > sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size) > > > > > > /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */ > > > -#define snd_wnd_cap tcp_info_cap(snd_wnd) > > > +#define snd_wnd_cap tcp_info_cap(snd_wnd) > > > /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */ > > > -#define bytes_acked_cap tcp_info_cap(bytes_acked) > > > +#define bytes_acked_cap tcp_info_cap(bytes_acked) > > > /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */ > > > -#define min_rtt_cap tcp_info_cap(min_rtt) > > > +#define min_rtt_cap tcp_info_cap(min_rtt) > > > +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */ > > > +#define delivery_rate_cap tcp_info_cap(delivery_rate) > > > > > > /* sendmsg() to socket */ > > > static struct iovec tcp_iov [UIO_MAXIOV]; > > > @@ -1048,6 +1053,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > > socklen_t sl = sizeof(*tinfo); > > > struct tcp_info_linux tinfo_new; > > > uint32_t new_wnd_to_tap = prev_wnd_to_tap; > > > + bool ack_everything = true; > > > int s = conn->sock; > > > > > > /* At this point we could ack all the data we've accepted for forwarding > > > @@ -1057,7 +1063,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > > * control behaviour. > > > * > > > * For it to be possible and worth it we need: > > > - * - The TCP_INFO Linux extension which gives us the peer acked bytes > > > + * - The TCP_INFO Linux extensions which give us the peer acked bytes > > > + * and the delivery rate (outbound bandwidth at receiver) > > > * - Not to be told not to (force_seq) > > > * - Not half-closed in the peer->guest direction > > > * With no data coming from the peer, we might not get events which > > > @@ -1067,19 +1074,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > > > * Data goes from socket to socket, with nothing meaningfully "in > > > * flight". > > > * - Not a pseudo-local connection (e.g. to a VM on the same host) > > > - * - Large enough send buffer > > > - * In these cases, there's not enough in flight to bother. > > > + * If it is, there's not enough in flight to bother. > > > + * - Sending buffer significantly larger than bandwidth * delay product > > > + * Meaning we're not bandwidth-bound and this is likely to be > > > + * interactive traffic where we want to preserve transparent > > > + * connection behaviour and latency. > > > > Do we actually want the sending buffer size here? Or the amount of > > buffer that's actually in use (SIOCOUTQ)? If we had a burst transfer > > followed by interactive traffic, the kernel could still have a large > > send buffer allocated, no? > > The kernel shrinks it rather fast, and if it's not fast enough, then it > still looks like bulk traffic. I tried several metrics (including > something based on the data just sent, which approximates SIOCOUTQ), > they are not as good as the current buffer size. Ok. Thinking about this, I guess the kernel has had quite some time to tweak its heuristics here, so making indirect use of that experience would be a good idea. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson