On Fri, Sep 29, 2023 at 05:04:45PM +0200, Stefano Brivio wrote: > It looks like we need it as workaround for this situation, readily > reproducible at least with a 6.5 Linux kernel, with default rmem_max > and wmem_max values: > > - an iperf3 client on the host sends about 160 KiB, typically > segmented into five frames by passt. We read this data using > MSG_PEEK > > - the iperf3 server on the guest starts receiving > > - meanwhile, the host kernel advertised a zero-sized window to the > receiver, as expected > > - eventually, the guest acknowledges all the data sent so far, and > we drop it from the buffer, courtesy of tcp_sock_consume(), using > recv() with MSG_TRUNC > > - the client, however, doesn't get an updated window value, and > even keepalive packets are answered with zero-window segments, > until the connection is closed > > It looks like dropping data from a socket using MSG_TRUNC doesn't > cause a recalculation of the window, which would be expected as a > result of any receiving operation that invalidates data on a buffer > (that is, not with MSG_PEEK). > > Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to > the previous value we clamped to, forces a recalculation of the > window which is advertised to the guest. > > I couldn't quite confirm this issue by following all the possible > code paths in the kernel, yet. If confirmed, this should be fixed in > the kernel, but meanwhile this workaround looks robust to me (and it > will be needed for backward compatibility anyway). > > Reported-by: Matej Hrica > Link: https://bugs.passt.top/show_bug.cgi?id=74 > Analysed-by: David Gibson > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson > --- > tcp.c | 29 ++++++++++++++++++++++++----- > 1 file changed, 24 insertions(+), 5 deletions(-) > > diff --git a/tcp.c b/tcp.c > index dff5e79..32917c8 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, > wnd <<= conn->ws_from_tap; > wnd = MIN(MAX_WINDOW, wnd); > > - if (conn->flags & WND_CLAMPED) { > + /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up > + * with a zero-sized window on a TCP socket, dropping data (once > + * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear > + * to be enough to make the kernel advertise a non-zero window to the > + * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing > + * value, fixes this. > + * > + * The STALLED flag on a connection is a sufficient indication that we > + * might have a zero-sized window on the socket, because it's set if we > + * exhausted the tap-side window, or if everything we receive from a > + * socket is already in flight to the guest. > + * > + * So, if STALLED is set, and we received a window value from the tap, > + * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated > + * further and fixed in the kernel instead, if confirmed. > + */ > + if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { > if (prev_scaled == wnd) > return; > > @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, > i = keep - 1; > } > > - tcp_clamp_window(c, conn, max_ack_seq_wnd); > - > /* On socket flush failure, pretend there was no ACK, try again later */ > if (ack && !tcp_sock_consume(conn, max_ack_seq)) > tcp_update_seqack_from_tap(c, conn, max_ack_seq); > > + tcp_clamp_window(c, conn, max_ack_seq_wnd); > + > if (retr) { > trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); > @@ -2572,8 +2588,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, > if (th->ack && !(conn->events & ESTABLISHED)) > tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); > > - conn_flag(c, conn, ~STALLED); > - > /* Establishing connection from socket */ > if (conn->events & SOCK_ACCEPTED) { > if (th->syn && th->ack && !th->fin) { > @@ -2628,6 +2642,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, > if (count == -1) > goto reset; > > + /* Note: STALLED matters for tcp_clamp_window(): unset it only after > + * processing data (and window) from the tap side > + */ > + conn_flag(c, conn, ~STALLED); > + > if (conn->seq_ack_to_tap != conn->seq_from_tap) > ack_due = 1; > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson