On Tue, Oct 07, 2025 at 12:32:54AM +0200, Stefano Brivio wrote: > On Fri, 3 Oct 2025 13:43:32 +1000 > David Gibson wrote: > > > On Thu, Oct 02, 2025 at 01:51:04PM +0200, Stefano Brivio wrote: > > > On Thu, 2 Oct 2025 13:02:09 +1000 > > > David Gibson wrote: > > > > On Thu, Oct 02, 2025 at 12:52:31PM +1000, David Gibson wrote: > > > > > On Thu, Oct 02, 2025 at 02:06:45AM +0200, Stefano Brivio wrote: > > [snip] > > > > > > diff --git a/tcp.c b/tcp.c > > > > > > index 3f7dc82..5a7a607 100644 > > > > > > --- a/tcp.c > > > > > > +++ b/tcp.c > > > > > > @@ -1769,7 +1769,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > > > > > } > > > > > > } > > > > > > > > > > > > - if (th->fin) > > > > > > + if (th->fin && seq == seq_from_tap) > > > > > > fin = 1; > > > > > > > > > > Can a FIN segment also contain data? My quick googling suggests yes. > > > > > > Yes, absolutely, my slow wiresharking over the years also confirms, and > > > it's so often the case that (I think) this issue doesn't happen so > > > frequently as it only occurs if we have a FIN segment without data. > > > > Makes sense. > > > > > If we have a data segment, with FIN set, that we can't fully transmit, > > > we already set 'partial_send' and won't set TAP_FIN_RCVD as a result. > > > > > > Another case where we want to ignore a FIN segment with data is if we > > > have a gap before it, but in that case we'll eventually set 'keep' and > > > return early. > > > > Ah, right. I'd noticed we set fin = 1 in that case, but forgotten > > about the exit before setting TAP_FIN_RCVD if keep is set. > > > > > > > If so, doesn't this logic need to go after we process the data > > > > > processing, so that seq_from_tap points to the end of the packet's > > > > > data, rather than the beginning? (And the handling of zero-length > > > > > packets would also need revision to match). > > > > > > This made sense to me for a moment but now I'm struggling to understand > > > or remember why. What I want to check here is that a FIN segment > > > without data (I should have specified in the commit message) is > > > acceptable because its sequence is as expected. > > > > Right. This is correct for zero-data FIN segments, but I think as a > > side-effect you've made it ignore certain FIN segments _with_ data. > > It will work in the common case where the data exactly follows on from > > what we already have. But in the case where the segment has some data > > we already have and some new data, the fin = 1 won't trip because seq > > != seq_from_tap. There isn't another place that will catch it > > instead, AFAICT. > > > > I guess it will be fine in the end, because with all the data acked, > > the guest should retransmit the FIN with no data. > > > > > But going back to FIN segments with data: why should we sum the length > > > to seq_from_tap before comparing the sequence? I don't understand what > > > additional check you want to introduce, or what case you want to cover. > > > > I was thinking about the case above, but I didn't explain it very > > well. > > > > > > Following on from that, it seems to me like it would make sense for > > > > FIN segments to also participate in the 'keep' mechanism. It should > > > > work eventually, but I expect it would be smoother in the case that we > > > > get a final burst of packets in a stream out of order. > > > > > > FIN segments with data already go through that dance. > > > > > > Without data, I guess you're right, we might have in the same batch > > > (not that I've ever seen it happening in practice) a FIN segment > > > without data that we process first (and now discard because of the > > > sequence number), and some data before that we process later, so we > > > shouldn't throw away the FIN segment because of that. We should, > > > conceptually, reorder it as well. > > > > > > It probably makes things more complicated for a reason that's not so > > > critical (ignoring a FIN is fine, we'll get another one), and I wanted > > > to have the simplest possible fix here. > > > > > > Let me see if I can make this entirely correct without a substantially > > > bigger change, I haven't really tried. > > > > How about this: > > > > diff --git a/tcp.c b/tcp.c > > index 7da41797..42e576b4 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1774,10 +1774,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > } > > } > > > > - if (th->fin) > > - fin = 1; > > - > > - if (!len) > > + if (!len && !th->fin) > > continue; > > > > seq_offset = seq_from_tap - seq; > > @@ -1820,6 +1817,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, > > break; > > seq_from_tap += size; > > iov_i += count; > > + if (th->fin) > > + fin = 1; > > > > if (keep == i) > > keep = -1; > > > > > > We'd need to double check that the "accept data segment" path is safe > > with len == 0, of course. > > For sure it's not before d2c33f45f7be ("tcp: Convert > tcp_data_from_tap() to use iov_tail"), because we might add > zero-length segments to the tcp_iov array, and that would make > backporting an otherwise simple and critical fix to slightly older > versions rather complicated. Kinda. It's not that complicated to deal with that case, by wrapping the actual data processing in an `if (len) { ... }` > After that commit, I'm not sure about the behaviour of > iov_tail_clone(). I think it will return 0, but it should be tested > (that assumption, itself, but also that the fix still works). Right, I think it's safe in that case - that's what I was looking at. > > But I think that will treat dataless and > > with-data FINs the same way, and let them use the keep mechanism. > > Given that the only advantage of doing this would be to handle a rare > (I guess) corner case, that is, an out-of-sequence FIN segment with > data, which is not critical anyway because the FIN will be > retransmitted, I would rather keep this (critical) fix as it is. > > I would suggest filing a separate ticket or anyway sending a separate > patch if you want to fix that other case. Fair enough. I'll wait for you to get the basic fix merge, then I might batch this cleanup/fixup with the reorg to clarify bytes_acked handling. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson