public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Jon Maloy <jmaloy@redhat.com>,
	Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data
Date: Wed, 10 Sep 2025 08:37:54 +0200	[thread overview]
Message-ID: <20250910083754.323c0b1e@elisabeth> (raw)
In-Reply-To: <aMDinHE5VYOiSEUJ@zatzit>

On Wed, 10 Sep 2025 12:29:48 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Sep 09, 2025 at 08:16:55PM +0200, Stefano Brivio wrote:
> > For some reason, tcp_vu_data_from_sock() already takes care of this,
> > but the non-vhost-user version ignores this possibility and just sends
> > out a FIN segment whenever we infer we received one host-side,
> > regardless of the fact that we might have unacknowledged data still to
> > send.
> > 
> > Somewhat surprisingly, this didn't cause any issue to be reported yet,
> > until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> > came around, leading to the following report from Paul, who hit this
> > running Podman tests:
> > 
> >   439   0.033032  169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480
> >   440   0.033055  169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264
> > 
> > we're sending data to the container, up to the edge of the window
> > 
> >   441   0.033059 192.168.122.100 → 169.254.1.2  60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0
> > 
> > and the container acknowledges it
> > 
> >   442   0.033091  169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656
> > 
> > we send more data, all we possibly can, in window
> > 
> >   443   0.033126 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
> > 
> > and the container shrinks the window due to the issue introduced
> > by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no
> > memory"). With a previous patch from this series, we rewind the
> > sequence, meaning that we assign conn->seq_to_tap from
> > conn->seq_ack_from_tap, so that we'll retransmit this segment, by
> > reading again from the socket, and increasing conn->seq_to_tap
> > once more.
> > 
> > However:
> > 
> >   444   0.033144  169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0
> > 
> > we eventually get a zero-length read from the socket and we miss the
> > fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a
> > FIN flag with the most recent sequence. The kernel insists:
> > 
> >   445   0.033147 192.168.122.100 → 169.254.1.2  60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
> > 
> > with its buggy zero-window update, but:
> > 
> >   446   0.033152 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0
> >   447   0.033202 192.168.122.100 → 169.254.1.2  60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0
> > 
> > we don't reset the TAP_FIN_SENT flag anymore, and don't resend
> > the FIN segment (nor data), as we already rewound the sequence
> > earlier.
> > 
> > To solve this, hold off the FIN segment until we get a zero-length
> > read from the socket *and* we know that there's no unacknowledged
> > pending data, also without vhost-user, in tcp_buf_data_from_sock().
> > 
> > Reported-by: Paul Holzinger <pholzing@redhat.com>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> At least, I think this makes sense.  As always, I find the semantics
> of the STALLED flag difficult to wrap my head around.

After this fix I start thinking that we should probably split STALLED
into several flags explicitly indicating what we are waiting for,
because it's rather complicated to make sure that some code path will
eventually take care of doing something on a later trigger/condition
(in this case, sending the FIN flag when we can).

I don't have a concrete list of possible flags in mind but something
like WAITING_ON_(x), or consistent sets of BLOCKED_BY_(x) (we can't
proceed with anything otherwise) and EXPECTING_(x) (x is expected but
we can proceed with something else).

Maybe we'll need to go the full (y)_BLOCKED_BY_(x) route, though
(FIN_BLOCKED_BY_TAP_ACK?). I hope not.

-- 
Stefano


  reply	other threads:[~2025-09-10  6:38 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09 18:16 [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 1/8] tcp: FIN flags have to be retransmitted as well Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 2/8] tcp: Factor sequence rewind for retransmissions into a new function Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 3/8] tcp: Rewind sequence when guest shrinks window to zero Stefano Brivio
2025-09-10  2:20   ` David Gibson
2025-09-10  6:37     ` Stefano Brivio
2025-09-10  7:18       ` David Gibson
2025-09-10 23:48   ` Jon Maloy
2025-09-09 18:16 ` [PATCH v4 4/8] tcp: Fix closing logic for half-closed connections Stefano Brivio
2025-09-10 23:56   ` Jon Maloy
2025-09-09 18:16 ` [PATCH v4 5/8] tcp: Don't try to transmit right after the peer shrank the window to zero Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 6/8] tcp: Cast operands of sequence comparison macros to uint32_t before using them Stefano Brivio
2025-09-10  2:21   ` David Gibson
2025-09-09 18:16 ` [PATCH v4 7/8] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent Stefano Brivio
2025-09-10  2:27   ` David Gibson
2025-09-10  9:57     ` Stefano Brivio
2025-09-09 18:16 ` [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data Stefano Brivio
2025-09-10  2:29   ` David Gibson
2025-09-10  6:37     ` Stefano Brivio [this message]
2025-09-10  9:10 ` [PATCH v4 0/8] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels Paul Holzinger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250910083754.323c0b1e@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=jmaloy@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=pholzing@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).