public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer
Date: Fri, 24 Mar 2023 20:20:24 +1100	[thread overview]
Message-ID: <ZB1rWGKz8gYrJnJW@yekko> (raw)
In-Reply-To: <20230324084259.431f9245@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 11269 bytes --]

On Fri, Mar 24, 2023 at 08:42:59AM +0100, Stefano Brivio wrote:
> On Fri, 24 Mar 2023 16:18:11 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Mar 23, 2023 at 05:08:31PM +0100, Stefano Brivio wrote:
> > 
> > I'm pretty sure this will make things better than they were, so in
> > that sense:
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > However, I'm not entirely convinced by some of the reasoning below.
> > 
> > > Since commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as
> > > needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we
> > > process an ACK segment, but, more correctly, only if we're really not
> > > waiting for a further ACK segment, that is, only if the acknowledged
> > > sequence matches what we sent.
> > > 
> > > In the new function implementing this, tcp_update_seqack_from_tap(),
> > > we also reset the retransmission counter and store the updated ACK
> > > sequence. Both should be done iff forward progress is acknowledged,
> > > implied by the fact that the new ACK sequence is greater than the
> > > one we previously stored.
> > > 
> > > At that point, it looked natural to also include the statements that
> > > clear and set the ACK_FROM_TAP_DUE flag inside the same conditional
> > > block: if we're not making forward progress, the need for an ACK, or
> > > lack thereof, should remain unchanged.
> > > 
> > > There's a case where this isn't true, though: if a client initiates  
> > 
> > Maybe mention this is a tap-side client specifically.
> > 
> > > a connection, and the server doesn't send data, with the client also
> > > not sending any further data except for what's possibly sent along
> > > with the first ACK segment following SYN, ACK from the server, we'll  
> > 
> > This doesn't seem right.  In my case the client *is* immediately
> > sending data.  It's *not* sending any in the ACK from the handshake
> > (is that even allowed?).
> 
> Yes, and I see that sometimes with iperf3:

Huh, interesting.  I'm not that astonished that it's allowed.  I'm
moderately suprised something's doing it in practice, I'm *really*
surprised it's not doing so non-deterministically.

> $ tshark -r iperf_01.pcap 
>     1   0.000000 10.131.1.134 → 10.128.2.170 58 TCP 52378 → 5201 [SYN] Seq=0 Win=14600 Len=0 MSS=1398
>     2   0.000064 10.128.2.170 → 10.131.1.134 58 TCP 5201 → 52378 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65480
>     3   0.000152 10.131.1.134 → 10.128.2.170 91 TCP 52378 → 5201 [ACK] Seq=1 Ack=1 Win=14600 Len=37
> 
> the client could even send data with the SYN segment, but in that case it
> shouldn't be delivered right away to the application, see RFC 793,
> section 3.4:
> 
>   Several examples of connection initiation follow.  Although these
>   examples do not show connection synchronization using data-carrying
>   segments, this is perfectly legitimate, so long as the receiving TCP
>   doesn't deliver the data to the user until it is clear the data is
>   valid (i.e., the data must be buffered at the receiver until the
>   connection reaches the ESTABLISHED state).
> 
> This never happens in practice, and as far as I can tell the kernel
> doesn't support this (see tcp_rcv_state_process() and
> tcp_conn_request(), they won't buffer that data), so passt will also
> discard this data, which needs to be re-sent (we only run on Linux,
> and we maintain the same behaviour as if passt weren't there).

Right, I saw that in the RFC, but it was pretty sure it was a thing
that nobody's really done for a long time.  Those old RFCs do
sometimes have stuff that seemed like a neat idea at the time, but
turned out not to be wise.  I did wonder if there might be a later RFC
banning or at least deprecating that practice.

> But data on the third segment can actually happen, and arguably we
> could say that the connection is ESTABLISHED at that point (there's
> no need for well-defined sequence points or suchlike).

Right, I was pretty sure that data on the SYN packets wasn't a thing
in practice, but I wasn't sure about the third packet.

> In that sense, the (p->count == 1) check in tcp_tap_handler() on
> TAP_SYN_RCVD isn't quite correct, and the client would need to
> retransmit the data with the current implementation, which is not
> ideal and could be improved. What's critical is that we don't reset
> the connection.
> 
> > AFAICT it's just the fact that the (socket side) server doesn't send
> > data which is relevant.
> 
> Maybe yes, I just couldn't reproduce this reliably with iperf3, and
> the difference between working and failing cases seemed to be that. On
> the other hand, your reproducer below shows this is not the case, so
> fine, this is irrelevant.
> 
> > > never, in the established state of the connection, call
> > > tcp_update_seqack_from_tap() with reported forward progress.
> > > 
> > > In that case, we'll never reset the initial ACK_FROM_TAP_DUE (used
> > > to trigger handshake timeouts), interpret the trigger as a the need
> > > for a retransmission (which won't actually retransmit anything), and
> > > eventually time out a perfectly healthy connection once we reach the
> > > maximum retransmission count.
> > > 
> > > This is relatively simple to reproduce if we switch back to 30s
> > > iperf3 test runs, but it depends on the timing of the iperf3 client:
> > > sometimes the first ACK from the client (part of the three-way
> > > handshake) will come with data (and we'll hit the problem), sometimes
> > > data will be sent later (and we call to tcp_update_seqack_from_tap()
> > > from tcp_data_from_tap() later, avoiding the issue).  
> > 
> > This last bit seems wrong too.  What I'm seeing is that
> > tcp_update_seqack_from_tap() *is* called later from
> > tcp_data_from_tap(), but it doesn't avoid the issue, because the
> > from-client ack number hasn't advanced, so it doesn't do anything.
> 
> Right, same as above.
> 
> > > A reliable reproducer is a simpler:
> > > 
> > >   $ strace -e accept,shutdown socat TCP-LISTEN:1111 STDIO &
> > >   [2] 2202832
> > >   $ pasta --config-net -- sh -c '(sleep 30; echo a) | socat STDIN TCP:88.198.0.161:1111'  
> > 
> > I assume 88.198.0.161 is the gateway address here?
> 
> Yes, see next paragraph of the commit message.
> 
> > >   accept(5, {sa_family=AF_INET, sin_port=htons(57200), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
> > >   shutdown(6, SHUT_RDWR)                  = 0
> > >   --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
> > >   --- stopped by SIGTTOU ---
> > >   2023/03/23 16:05:06 socat[3] E write(5, 0x5645dbca9000, 2): Connection reset by peer  
> > 
> > That reproduces a problem, but not exactly the one I'm seeing (see
> > notes above).  Mine can be reproduced with:
> > 
> > $ strace -e accept,shutdown socat -u TCP-LISTEN:1111 OPEN:/dev/null,wronly &
> > $ ./pasta --config-net -- socat -u OPEN:/dev/zero,rdonly TCP:192.168.17.1:1111
> > 
> > then waiting 30s.
> 
> Okay, great, I just tried, this is fixed as well.
> 
> > > where the socat client connects, and no data exchange happens for 30s
> > > in either direction. Here, I'm using the default gateway address to
> > > reach the socat server on the host.
> > > 
> > > Fix this by clearing and setting the ACK_FROM_TAP_DUE flag regardless
> > > of reported forward progress. If we clear it when it's already unset,
> > > conn_flag() will do nothing with it. If it was set, it's always fine
> > > to reschedule the timer (as long as we're waiting for a further ACK),
> > > because we just received an ACK segment, no matter its sequence.  
> > 
> > Hrm... is that actually true?  Consider this situation: the server
> > (socket side) sent some data that got lost, so the client (tap side)
> > is not ever going to ack it.  However, the client is continuing to
> > send data, so we keep getting acks from it that don't make forward
> > progress.  Won't the change as proposed mean we then keep delaying the
> > retransmit indefinitely?
> 
> Ah, yes, right, I wanted to keep it simple but in this case we can't
> reschedule, I'm sending a v2 now.
> 
> > I've been thinking about this a bunch, and it's doing my head in a
> > bit.  However, I think the problem is that we have ACK_FROM_TAP_DUE
> > set in the first place, when we don't actually have any outstanding
> > sent data to ack.
> 
> That's not just for data, it's for any ACK, and timer_handler() also
> uses that flag to cover handshake timeouts.
> 
> > I think that's happening because we're not clearing
> > it on the very first ACK from the client  - the one in the handshake.
> 
> Right, and with this change we'll clear that.

We will, but with the side effect noted above.

> > That's because of the + 1 in:
> > 	tcp_seq_init(c, conn, now);
> > 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
> > 
> > I think we have a subtle conflict of two reasonable seeming invariants
> > here (written for the client on tap side case below, but there are
> > variants for other cases, I think).
> > 
> > A) We expect an ack iff seq_to_tap > seq_ack_from_tap
> > 
> >    According to this invariant, we want to remove the + 1 there.  We
> >    advance seq_to_tap when we send the syn-ack, and ACK_FROM_TAP_DUE
> >    is set.  seq_ack_from_tap catches up when we receive the handshake
> >    ack and ACK_FROM_TAP_DUE is cleared.
> > 
> > B) (seq_to_tap - seq_ack_from_tap) is equal to the number of bytes in
> >    the socket buffer which have been sent to the client at least once
> > 
> >    This wants the + 1, because before the server sends data there's
> >    obviously nothing in the socket buffers, including during the
> >    handshake.
> > 
> > But... I think the only place that relies on (B) is
> > tcp_data_from_sock(), and I don't think it ever gets called with
> > !ESTABLISHED.
> 
> No, it doesn't.
> 
> > So.. I think we want to stick with invariant (A) and
> > remove the "+ 1", for both the conn_from_sock and conn_from_tap
> > variants.
> 
> In both places, the "+ 1" represents the SYN segment, which "counts as
> one". RFC 793, section 3.1:
> 
>   If SYN is present the sequence number is the
>   initial sequence number (ISN) and the first data octet is ISN+1.

I realize it represents the SYN, but that's not the point.
seq_ack_from_tap isn't the ack we *expect*, it's the one we've already
gotten.  By putting the + 1 here, we're treating the client as having
acked the syn (or syn-ack) before it actually did.

> However, while this was mentally convenient for me, I see how it
> violates your expectation at A), so both "+ 1" could be replaced by
> comments, really.

> Right now I just want to fix this regression fast in a non-invasive
> way (in the real world, it probably only happens with "long" iperf3
> tests, but that's, well, a common use case for passt). But then feel
> free to patch that to fit A) above.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

      reply	other threads:[~2023-03-24  9:21 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-23 16:08 [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer Stefano Brivio
2023-03-24  5:18 ` David Gibson
2023-03-24  7:42   ` Stefano Brivio
2023-03-24  9:20     ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZB1rWGKz8gYrJnJW@yekko \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).