public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer
@ 2023-03-23 16:08 Stefano Brivio
  2023-03-24  5:18 ` David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: Stefano Brivio @ 2023-03-23 16:08 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

Since commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as
needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we
process an ACK segment, but, more correctly, only if we're really not
waiting for a further ACK segment, that is, only if the acknowledged
sequence matches what we sent.

In the new function implementing this, tcp_update_seqack_from_tap(),
we also reset the retransmission counter and store the updated ACK
sequence. Both should be done iff forward progress is acknowledged,
implied by the fact that the new ACK sequence is greater than the
one we previously stored.

At that point, it looked natural to also include the statements that
clear and set the ACK_FROM_TAP_DUE flag inside the same conditional
block: if we're not making forward progress, the need for an ACK, or
lack thereof, should remain unchanged.

There's a case where this isn't true, though: if a client initiates
a connection, and the server doesn't send data, with the client also
not sending any further data except for what's possibly sent along
with the first ACK segment following SYN, ACK from the server, we'll
never, in the established state of the connection, call
tcp_update_seqack_from_tap() with reported forward progress.

In that case, we'll never reset the initial ACK_FROM_TAP_DUE (used
to trigger handshake timeouts), interpret the trigger as a the need
for a retransmission (which won't actually retransmit anything), and
eventually time out a perfectly healthy connection once we reach the
maximum retransmission count.

This is relatively simple to reproduce if we switch back to 30s
iperf3 test runs, but it depends on the timing of the iperf3 client:
sometimes the first ACK from the client (part of the three-way
handshake) will come with data (and we'll hit the problem), sometimes
data will be sent later (and we call to tcp_update_seqack_from_tap()
from tcp_data_from_tap() later, avoiding the issue).

A reliable reproducer is a simpler:

  $ strace -e accept,shutdown socat TCP-LISTEN:1111 STDIO &
  [2] 2202832
  $ pasta --config-net -- sh -c '(sleep 30; echo a) | socat STDIN TCP:88.198.0.161:1111'
  accept(5, {sa_family=AF_INET, sin_port=htons(57200), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
  shutdown(6, SHUT_RDWR)                  = 0
  --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
  --- stopped by SIGTTOU ---
  2023/03/23 16:05:06 socat[3] E write(5, 0x5645dbca9000, 2): Connection reset by peer

where the socat client connects, and no data exchange happens for 30s
in either direction. Here, I'm using the default gateway address to
reach the socat server on the host.

Fix this by clearing and setting the ACK_FROM_TAP_DUE flag regardless
of reported forward progress. If we clear it when it's already unset,
conn_flag() will do nothing with it. If it was set, it's always fine
to reschedule the timer (as long as we're waiting for a further ACK),
because we just received an ACK segment, no matter its sequence.

Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tcp.c b/tcp.c
index f156287..b225bbe 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1610,12 +1610,12 @@ out:
 static void tcp_update_seqack_from_tap(const struct ctx *c,
 				       struct tcp_tap_conn *conn, uint32_t seq)
 {
-	if (SEQ_GT(seq, conn->seq_ack_from_tap)) {
-		if (seq == conn->seq_to_tap)
-			conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
-		else
-			conn_flag(c, conn, ACK_FROM_TAP_DUE);
+	if (seq == conn->seq_to_tap)
+		conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
+	else
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
+	if (SEQ_GT(seq, conn->seq_ack_from_tap)) {
 		conn->retrans = 0;
 		conn->seq_ack_from_tap = seq;
 	}
-- 
@@ -1610,12 +1610,12 @@ out:
 static void tcp_update_seqack_from_tap(const struct ctx *c,
 				       struct tcp_tap_conn *conn, uint32_t seq)
 {
-	if (SEQ_GT(seq, conn->seq_ack_from_tap)) {
-		if (seq == conn->seq_to_tap)
-			conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
-		else
-			conn_flag(c, conn, ACK_FROM_TAP_DUE);
+	if (seq == conn->seq_to_tap)
+		conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
+	else
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
+	if (SEQ_GT(seq, conn->seq_ack_from_tap)) {
 		conn->retrans = 0;
 		conn->seq_ack_from_tap = seq;
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer
  2023-03-23 16:08 [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer Stefano Brivio
@ 2023-03-24  5:18 ` David Gibson
  2023-03-24  7:42   ` Stefano Brivio
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2023-03-24  5:18 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7467 bytes --]

On Thu, Mar 23, 2023 at 05:08:31PM +0100, Stefano Brivio wrote:

I'm pretty sure this will make things better than they were, so in
that sense:

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

However, I'm not entirely convinced by some of the reasoning below.

> Since commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as
> needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we
> process an ACK segment, but, more correctly, only if we're really not
> waiting for a further ACK segment, that is, only if the acknowledged
> sequence matches what we sent.
> 
> In the new function implementing this, tcp_update_seqack_from_tap(),
> we also reset the retransmission counter and store the updated ACK
> sequence. Both should be done iff forward progress is acknowledged,
> implied by the fact that the new ACK sequence is greater than the
> one we previously stored.
> 
> At that point, it looked natural to also include the statements that
> clear and set the ACK_FROM_TAP_DUE flag inside the same conditional
> block: if we're not making forward progress, the need for an ACK, or
> lack thereof, should remain unchanged.
> 
> There's a case where this isn't true, though: if a client initiates

Maybe mention this is a tap-side client specifically.

> a connection, and the server doesn't send data, with the client also
> not sending any further data except for what's possibly sent along
> with the first ACK segment following SYN, ACK from the server, we'll

This doesn't seem right.  In my case the client *is* immediately
sending data.  It's *not* sending any in the ACK from the handshake
(is that even allowed?).  AFAICT it's just the fact that the (socket
side) server doesn't send data which is relevant.

> never, in the established state of the connection, call
> tcp_update_seqack_from_tap() with reported forward progress.
> 
> In that case, we'll never reset the initial ACK_FROM_TAP_DUE (used
> to trigger handshake timeouts), interpret the trigger as a the need
> for a retransmission (which won't actually retransmit anything), and
> eventually time out a perfectly healthy connection once we reach the
> maximum retransmission count.
> 
> This is relatively simple to reproduce if we switch back to 30s
> iperf3 test runs, but it depends on the timing of the iperf3 client:
> sometimes the first ACK from the client (part of the three-way
> handshake) will come with data (and we'll hit the problem), sometimes
> data will be sent later (and we call to tcp_update_seqack_from_tap()
> from tcp_data_from_tap() later, avoiding the issue).

This last bit seems wrong too.  What I'm seeing is that
tcp_update_seqack_from_tap() *is* called later from
tcp_data_from_tap(), but it doesn't avoid the issue, because the
from-client ack number hasn't advanced, so it doesn't do anything.

> A reliable reproducer is a simpler:
> 
>   $ strace -e accept,shutdown socat TCP-LISTEN:1111 STDIO &
>   [2] 2202832
>   $ pasta --config-net -- sh -c '(sleep 30; echo a) | socat STDIN TCP:88.198.0.161:1111'

I assume 88.198.0.161 is the gateway address here?

>   accept(5, {sa_family=AF_INET, sin_port=htons(57200), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
>   shutdown(6, SHUT_RDWR)                  = 0
>   --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
>   --- stopped by SIGTTOU ---
>   2023/03/23 16:05:06 socat[3] E write(5, 0x5645dbca9000, 2): Connection reset by peer

That reproduces a problem, but not exactly the one I'm seeing (see
notes above).  Mine can be reproduced with:

$ strace -e accept,shutdown socat -u TCP-LISTEN:1111 OPEN:/dev/null,wronly &
$ ./pasta --config-net -- socat -u OPEN:/dev/zero,rdonly TCP:192.168.17.1:1111

then waiting 30s.

> where the socat client connects, and no data exchange happens for 30s
> in either direction. Here, I'm using the default gateway address to
> reach the socat server on the host.
> 
> Fix this by clearing and setting the ACK_FROM_TAP_DUE flag regardless
> of reported forward progress. If we clear it when it's already unset,
> conn_flag() will do nothing with it. If it was set, it's always fine
> to reschedule the timer (as long as we're waiting for a further ACK),
> because we just received an ACK segment, no matter its sequence.

Hrm... is that actually true?  Consider this situation: the server
(socket side) sent some data that got lost, so the client (tap side)
is not ever going to ack it.  However, the client is continuing to
send data, so we keep getting acks from it that don't make forward
progress.  Won't the change as proposed mean we then keep delaying the
retransmit indefinitely?


I've been thinking about this a bunch, and it's doing my head in a
bit.  However, I think the problem is that we have ACK_FROM_TAP_DUE
set in the first place, when we don't actually have any outstanding
sent data to ack.  I think that's happening because we're not clearing
it on the very first ACK from the client  - the one in the handshake.

That's because of the + 1 in:
	tcp_seq_init(c, conn, now);
	conn->seq_ack_from_tap = conn->seq_to_tap + 1;

I think we have a subtle conflict of two reasonable seeming invariants
here (written for the client on tap side case below, but there are
variants for other cases, I think).

A) We expect an ack iff seq_to_tap > seq_ack_from_tap

   According to this invariant, we want to remove the + 1 there.  We
   advance seq_to_tap when we send the syn-ack, and ACK_FROM_TAP_DUE
   is set.  seq_ack_from_tap catches up when we receive the handshake
   ack and ACK_FROM_TAP_DUE is cleared.

B) (seq_to_tap - seq_ack_from_tap) is equal to the number of bytes in
   the socket buffer which have been sent to the client at least once

   This wants the + 1, because before the server sends data there's
   obviously nothing in the socket buffers, including during the
   handshake.

But... I think the only place that relies on (B) is
tcp_data_from_sock(), and I don't think it ever gets called with
!ESTABLISHED.  So.. I think we want to stick with invariant (A) and
remove the "+ 1", for both the conn_from_sock and conn_from_tap
variants.


> Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
> Reported-by: David Gibson <david@gibson.dropbear.id.au>
> Analysed-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index f156287..b225bbe 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1610,12 +1610,12 @@ out:
>  static void tcp_update_seqack_from_tap(const struct ctx *c,
>  				       struct tcp_tap_conn *conn, uint32_t seq)
>  {
> -	if (SEQ_GT(seq, conn->seq_ack_from_tap)) {
> -		if (seq == conn->seq_to_tap)
> -			conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
> -		else
> -			conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +	if (seq == conn->seq_to_tap)
> +		conn_flag(c, conn, ~ACK_FROM_TAP_DUE);
> +	else
> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  
> +	if (SEQ_GT(seq, conn->seq_ack_from_tap)) {
>  		conn->retrans = 0;
>  		conn->seq_ack_from_tap = seq;
>  	}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer
  2023-03-24  5:18 ` David Gibson
@ 2023-03-24  7:42   ` Stefano Brivio
  2023-03-24  9:20     ` David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: Stefano Brivio @ 2023-03-24  7:42 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 24 Mar 2023 16:18:11 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Mar 23, 2023 at 05:08:31PM +0100, Stefano Brivio wrote:
> 
> I'm pretty sure this will make things better than they were, so in
> that sense:
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> However, I'm not entirely convinced by some of the reasoning below.
> 
> > Since commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as
> > needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we
> > process an ACK segment, but, more correctly, only if we're really not
> > waiting for a further ACK segment, that is, only if the acknowledged
> > sequence matches what we sent.
> > 
> > In the new function implementing this, tcp_update_seqack_from_tap(),
> > we also reset the retransmission counter and store the updated ACK
> > sequence. Both should be done iff forward progress is acknowledged,
> > implied by the fact that the new ACK sequence is greater than the
> > one we previously stored.
> > 
> > At that point, it looked natural to also include the statements that
> > clear and set the ACK_FROM_TAP_DUE flag inside the same conditional
> > block: if we're not making forward progress, the need for an ACK, or
> > lack thereof, should remain unchanged.
> > 
> > There's a case where this isn't true, though: if a client initiates  
> 
> Maybe mention this is a tap-side client specifically.
> 
> > a connection, and the server doesn't send data, with the client also
> > not sending any further data except for what's possibly sent along
> > with the first ACK segment following SYN, ACK from the server, we'll  
> 
> This doesn't seem right.  In my case the client *is* immediately
> sending data.  It's *not* sending any in the ACK from the handshake
> (is that even allowed?).

Yes, and I see that sometimes with iperf3:

$ tshark -r iperf_01.pcap 
    1   0.000000 10.131.1.134 → 10.128.2.170 58 TCP 52378 → 5201 [SYN] Seq=0 Win=14600 Len=0 MSS=1398
    2   0.000064 10.128.2.170 → 10.131.1.134 58 TCP 5201 → 52378 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65480
    3   0.000152 10.131.1.134 → 10.128.2.170 91 TCP 52378 → 5201 [ACK] Seq=1 Ack=1 Win=14600 Len=37

the client could even send data with the SYN segment, but in that case it
shouldn't be delivered right away to the application, see RFC 793,
section 3.4:

  Several examples of connection initiation follow.  Although these
  examples do not show connection synchronization using data-carrying
  segments, this is perfectly legitimate, so long as the receiving TCP
  doesn't deliver the data to the user until it is clear the data is
  valid (i.e., the data must be buffered at the receiver until the
  connection reaches the ESTABLISHED state).

This never happens in practice, and as far as I can tell the kernel
doesn't support this (see tcp_rcv_state_process() and
tcp_conn_request(), they won't buffer that data), so passt will also
discard this data, which needs to be re-sent (we only run on Linux,
and we maintain the same behaviour as if passt weren't there).

But data on the third segment can actually happen, and arguably we
could say that the connection is ESTABLISHED at that point (there's
no need for well-defined sequence points or suchlike).

In that sense, the (p->count == 1) check in tcp_tap_handler() on
TAP_SYN_RCVD isn't quite correct, and the client would need to
retransmit the data with the current implementation, which is not
ideal and could be improved. What's critical is that we don't reset
the connection.

> AFAICT it's just the fact that the (socket side) server doesn't send
> data which is relevant.

Maybe yes, I just couldn't reproduce this reliably with iperf3, and
the difference between working and failing cases seemed to be that. On
the other hand, your reproducer below shows this is not the case, so
fine, this is irrelevant.

> > never, in the established state of the connection, call
> > tcp_update_seqack_from_tap() with reported forward progress.
> > 
> > In that case, we'll never reset the initial ACK_FROM_TAP_DUE (used
> > to trigger handshake timeouts), interpret the trigger as a the need
> > for a retransmission (which won't actually retransmit anything), and
> > eventually time out a perfectly healthy connection once we reach the
> > maximum retransmission count.
> > 
> > This is relatively simple to reproduce if we switch back to 30s
> > iperf3 test runs, but it depends on the timing of the iperf3 client:
> > sometimes the first ACK from the client (part of the three-way
> > handshake) will come with data (and we'll hit the problem), sometimes
> > data will be sent later (and we call to tcp_update_seqack_from_tap()
> > from tcp_data_from_tap() later, avoiding the issue).  
> 
> This last bit seems wrong too.  What I'm seeing is that
> tcp_update_seqack_from_tap() *is* called later from
> tcp_data_from_tap(), but it doesn't avoid the issue, because the
> from-client ack number hasn't advanced, so it doesn't do anything.

Right, same as above.

> > A reliable reproducer is a simpler:
> > 
> >   $ strace -e accept,shutdown socat TCP-LISTEN:1111 STDIO &
> >   [2] 2202832
> >   $ pasta --config-net -- sh -c '(sleep 30; echo a) | socat STDIN TCP:88.198.0.161:1111'  
> 
> I assume 88.198.0.161 is the gateway address here?

Yes, see next paragraph of the commit message.

> >   accept(5, {sa_family=AF_INET, sin_port=htons(57200), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
> >   shutdown(6, SHUT_RDWR)                  = 0
> >   --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
> >   --- stopped by SIGTTOU ---
> >   2023/03/23 16:05:06 socat[3] E write(5, 0x5645dbca9000, 2): Connection reset by peer  
> 
> That reproduces a problem, but not exactly the one I'm seeing (see
> notes above).  Mine can be reproduced with:
> 
> $ strace -e accept,shutdown socat -u TCP-LISTEN:1111 OPEN:/dev/null,wronly &
> $ ./pasta --config-net -- socat -u OPEN:/dev/zero,rdonly TCP:192.168.17.1:1111
> 
> then waiting 30s.

Okay, great, I just tried, this is fixed as well.

> > where the socat client connects, and no data exchange happens for 30s
> > in either direction. Here, I'm using the default gateway address to
> > reach the socat server on the host.
> > 
> > Fix this by clearing and setting the ACK_FROM_TAP_DUE flag regardless
> > of reported forward progress. If we clear it when it's already unset,
> > conn_flag() will do nothing with it. If it was set, it's always fine
> > to reschedule the timer (as long as we're waiting for a further ACK),
> > because we just received an ACK segment, no matter its sequence.  
> 
> Hrm... is that actually true?  Consider this situation: the server
> (socket side) sent some data that got lost, so the client (tap side)
> is not ever going to ack it.  However, the client is continuing to
> send data, so we keep getting acks from it that don't make forward
> progress.  Won't the change as proposed mean we then keep delaying the
> retransmit indefinitely?

Ah, yes, right, I wanted to keep it simple but in this case we can't
reschedule, I'm sending a v2 now.

> I've been thinking about this a bunch, and it's doing my head in a
> bit.  However, I think the problem is that we have ACK_FROM_TAP_DUE
> set in the first place, when we don't actually have any outstanding
> sent data to ack.

That's not just for data, it's for any ACK, and timer_handler() also
uses that flag to cover handshake timeouts.

> I think that's happening because we're not clearing
> it on the very first ACK from the client  - the one in the handshake.

Right, and with this change we'll clear that.

> That's because of the + 1 in:
> 	tcp_seq_init(c, conn, now);
> 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
> 
> I think we have a subtle conflict of two reasonable seeming invariants
> here (written for the client on tap side case below, but there are
> variants for other cases, I think).
> 
> A) We expect an ack iff seq_to_tap > seq_ack_from_tap
> 
>    According to this invariant, we want to remove the + 1 there.  We
>    advance seq_to_tap when we send the syn-ack, and ACK_FROM_TAP_DUE
>    is set.  seq_ack_from_tap catches up when we receive the handshake
>    ack and ACK_FROM_TAP_DUE is cleared.
> 
> B) (seq_to_tap - seq_ack_from_tap) is equal to the number of bytes in
>    the socket buffer which have been sent to the client at least once
> 
>    This wants the + 1, because before the server sends data there's
>    obviously nothing in the socket buffers, including during the
>    handshake.
> 
> But... I think the only place that relies on (B) is
> tcp_data_from_sock(), and I don't think it ever gets called with
> !ESTABLISHED.

No, it doesn't.

> So.. I think we want to stick with invariant (A) and
> remove the "+ 1", for both the conn_from_sock and conn_from_tap
> variants.

In both places, the "+ 1" represents the SYN segment, which "counts as
one". RFC 793, section 3.1:

  If SYN is present the sequence number is the
  initial sequence number (ISN) and the first data octet is ISN+1.

However, while this was mentally convenient for me, I see how it
violates your expectation at A), so both "+ 1" could be replaced by
comments, really.

Right now I just want to fix this regression fast in a non-invasive
way (in the real world, it probably only happens with "long" iperf3
tests, but that's, well, a common use case for passt). But then feel
free to patch that to fit A) above.

-- 
Stefano


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer
  2023-03-24  7:42   ` Stefano Brivio
@ 2023-03-24  9:20     ` David Gibson
  0 siblings, 0 replies; 4+ messages in thread
From: David Gibson @ 2023-03-24  9:20 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 11269 bytes --]

On Fri, Mar 24, 2023 at 08:42:59AM +0100, Stefano Brivio wrote:
> On Fri, 24 Mar 2023 16:18:11 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Mar 23, 2023 at 05:08:31PM +0100, Stefano Brivio wrote:
> > 
> > I'm pretty sure this will make things better than they were, so in
> > that sense:
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > However, I'm not entirely convinced by some of the reasoning below.
> > 
> > > Since commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as
> > > needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we
> > > process an ACK segment, but, more correctly, only if we're really not
> > > waiting for a further ACK segment, that is, only if the acknowledged
> > > sequence matches what we sent.
> > > 
> > > In the new function implementing this, tcp_update_seqack_from_tap(),
> > > we also reset the retransmission counter and store the updated ACK
> > > sequence. Both should be done iff forward progress is acknowledged,
> > > implied by the fact that the new ACK sequence is greater than the
> > > one we previously stored.
> > > 
> > > At that point, it looked natural to also include the statements that
> > > clear and set the ACK_FROM_TAP_DUE flag inside the same conditional
> > > block: if we're not making forward progress, the need for an ACK, or
> > > lack thereof, should remain unchanged.
> > > 
> > > There's a case where this isn't true, though: if a client initiates  
> > 
> > Maybe mention this is a tap-side client specifically.
> > 
> > > a connection, and the server doesn't send data, with the client also
> > > not sending any further data except for what's possibly sent along
> > > with the first ACK segment following SYN, ACK from the server, we'll  
> > 
> > This doesn't seem right.  In my case the client *is* immediately
> > sending data.  It's *not* sending any in the ACK from the handshake
> > (is that even allowed?).
> 
> Yes, and I see that sometimes with iperf3:

Huh, interesting.  I'm not that astonished that it's allowed.  I'm
moderately suprised something's doing it in practice, I'm *really*
surprised it's not doing so non-deterministically.

> $ tshark -r iperf_01.pcap 
>     1   0.000000 10.131.1.134 → 10.128.2.170 58 TCP 52378 → 5201 [SYN] Seq=0 Win=14600 Len=0 MSS=1398
>     2   0.000064 10.128.2.170 → 10.131.1.134 58 TCP 5201 → 52378 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65480
>     3   0.000152 10.131.1.134 → 10.128.2.170 91 TCP 52378 → 5201 [ACK] Seq=1 Ack=1 Win=14600 Len=37
> 
> the client could even send data with the SYN segment, but in that case it
> shouldn't be delivered right away to the application, see RFC 793,
> section 3.4:
> 
>   Several examples of connection initiation follow.  Although these
>   examples do not show connection synchronization using data-carrying
>   segments, this is perfectly legitimate, so long as the receiving TCP
>   doesn't deliver the data to the user until it is clear the data is
>   valid (i.e., the data must be buffered at the receiver until the
>   connection reaches the ESTABLISHED state).
> 
> This never happens in practice, and as far as I can tell the kernel
> doesn't support this (see tcp_rcv_state_process() and
> tcp_conn_request(), they won't buffer that data), so passt will also
> discard this data, which needs to be re-sent (we only run on Linux,
> and we maintain the same behaviour as if passt weren't there).

Right, I saw that in the RFC, but it was pretty sure it was a thing
that nobody's really done for a long time.  Those old RFCs do
sometimes have stuff that seemed like a neat idea at the time, but
turned out not to be wise.  I did wonder if there might be a later RFC
banning or at least deprecating that practice.

> But data on the third segment can actually happen, and arguably we
> could say that the connection is ESTABLISHED at that point (there's
> no need for well-defined sequence points or suchlike).

Right, I was pretty sure that data on the SYN packets wasn't a thing
in practice, but I wasn't sure about the third packet.

> In that sense, the (p->count == 1) check in tcp_tap_handler() on
> TAP_SYN_RCVD isn't quite correct, and the client would need to
> retransmit the data with the current implementation, which is not
> ideal and could be improved. What's critical is that we don't reset
> the connection.
> 
> > AFAICT it's just the fact that the (socket side) server doesn't send
> > data which is relevant.
> 
> Maybe yes, I just couldn't reproduce this reliably with iperf3, and
> the difference between working and failing cases seemed to be that. On
> the other hand, your reproducer below shows this is not the case, so
> fine, this is irrelevant.
> 
> > > never, in the established state of the connection, call
> > > tcp_update_seqack_from_tap() with reported forward progress.
> > > 
> > > In that case, we'll never reset the initial ACK_FROM_TAP_DUE (used
> > > to trigger handshake timeouts), interpret the trigger as a the need
> > > for a retransmission (which won't actually retransmit anything), and
> > > eventually time out a perfectly healthy connection once we reach the
> > > maximum retransmission count.
> > > 
> > > This is relatively simple to reproduce if we switch back to 30s
> > > iperf3 test runs, but it depends on the timing of the iperf3 client:
> > > sometimes the first ACK from the client (part of the three-way
> > > handshake) will come with data (and we'll hit the problem), sometimes
> > > data will be sent later (and we call to tcp_update_seqack_from_tap()
> > > from tcp_data_from_tap() later, avoiding the issue).  
> > 
> > This last bit seems wrong too.  What I'm seeing is that
> > tcp_update_seqack_from_tap() *is* called later from
> > tcp_data_from_tap(), but it doesn't avoid the issue, because the
> > from-client ack number hasn't advanced, so it doesn't do anything.
> 
> Right, same as above.
> 
> > > A reliable reproducer is a simpler:
> > > 
> > >   $ strace -e accept,shutdown socat TCP-LISTEN:1111 STDIO &
> > >   [2] 2202832
> > >   $ pasta --config-net -- sh -c '(sleep 30; echo a) | socat STDIN TCP:88.198.0.161:1111'  
> > 
> > I assume 88.198.0.161 is the gateway address here?
> 
> Yes, see next paragraph of the commit message.
> 
> > >   accept(5, {sa_family=AF_INET, sin_port=htons(57200), sin_addr=inet_addr("127.0.0.1")}, [16]) = 6
> > >   shutdown(6, SHUT_RDWR)                  = 0
> > >   --- SIGTTOU {si_signo=SIGTTOU, si_code=SI_KERNEL} ---
> > >   --- stopped by SIGTTOU ---
> > >   2023/03/23 16:05:06 socat[3] E write(5, 0x5645dbca9000, 2): Connection reset by peer  
> > 
> > That reproduces a problem, but not exactly the one I'm seeing (see
> > notes above).  Mine can be reproduced with:
> > 
> > $ strace -e accept,shutdown socat -u TCP-LISTEN:1111 OPEN:/dev/null,wronly &
> > $ ./pasta --config-net -- socat -u OPEN:/dev/zero,rdonly TCP:192.168.17.1:1111
> > 
> > then waiting 30s.
> 
> Okay, great, I just tried, this is fixed as well.
> 
> > > where the socat client connects, and no data exchange happens for 30s
> > > in either direction. Here, I'm using the default gateway address to
> > > reach the socat server on the host.
> > > 
> > > Fix this by clearing and setting the ACK_FROM_TAP_DUE flag regardless
> > > of reported forward progress. If we clear it when it's already unset,
> > > conn_flag() will do nothing with it. If it was set, it's always fine
> > > to reschedule the timer (as long as we're waiting for a further ACK),
> > > because we just received an ACK segment, no matter its sequence.  
> > 
> > Hrm... is that actually true?  Consider this situation: the server
> > (socket side) sent some data that got lost, so the client (tap side)
> > is not ever going to ack it.  However, the client is continuing to
> > send data, so we keep getting acks from it that don't make forward
> > progress.  Won't the change as proposed mean we then keep delaying the
> > retransmit indefinitely?
> 
> Ah, yes, right, I wanted to keep it simple but in this case we can't
> reschedule, I'm sending a v2 now.
> 
> > I've been thinking about this a bunch, and it's doing my head in a
> > bit.  However, I think the problem is that we have ACK_FROM_TAP_DUE
> > set in the first place, when we don't actually have any outstanding
> > sent data to ack.
> 
> That's not just for data, it's for any ACK, and timer_handler() also
> uses that flag to cover handshake timeouts.
> 
> > I think that's happening because we're not clearing
> > it on the very first ACK from the client  - the one in the handshake.
> 
> Right, and with this change we'll clear that.

We will, but with the side effect noted above.

> > That's because of the + 1 in:
> > 	tcp_seq_init(c, conn, now);
> > 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
> > 
> > I think we have a subtle conflict of two reasonable seeming invariants
> > here (written for the client on tap side case below, but there are
> > variants for other cases, I think).
> > 
> > A) We expect an ack iff seq_to_tap > seq_ack_from_tap
> > 
> >    According to this invariant, we want to remove the + 1 there.  We
> >    advance seq_to_tap when we send the syn-ack, and ACK_FROM_TAP_DUE
> >    is set.  seq_ack_from_tap catches up when we receive the handshake
> >    ack and ACK_FROM_TAP_DUE is cleared.
> > 
> > B) (seq_to_tap - seq_ack_from_tap) is equal to the number of bytes in
> >    the socket buffer which have been sent to the client at least once
> > 
> >    This wants the + 1, because before the server sends data there's
> >    obviously nothing in the socket buffers, including during the
> >    handshake.
> > 
> > But... I think the only place that relies on (B) is
> > tcp_data_from_sock(), and I don't think it ever gets called with
> > !ESTABLISHED.
> 
> No, it doesn't.
> 
> > So.. I think we want to stick with invariant (A) and
> > remove the "+ 1", for both the conn_from_sock and conn_from_tap
> > variants.
> 
> In both places, the "+ 1" represents the SYN segment, which "counts as
> one". RFC 793, section 3.1:
> 
>   If SYN is present the sequence number is the
>   initial sequence number (ISN) and the first data octet is ISN+1.

I realize it represents the SYN, but that's not the point.
seq_ack_from_tap isn't the ack we *expect*, it's the one we've already
gotten.  By putting the + 1 here, we're treating the client as having
acked the syn (or syn-ack) before it actually did.

> However, while this was mentally convenient for me, I see how it
> violates your expectation at A), so both "+ 1" could be replaced by
> comments, really.

> Right now I just want to fix this regression fast in a non-invasive
> way (in the real world, it probably only happens with "long" iperf3
> tests, but that's, well, a common use case for passt). But then feel
> free to patch that to fit A) above.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-03-24  9:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-23 16:08 [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer Stefano Brivio
2023-03-24  5:18 ` David Gibson
2023-03-24  7:42   ` Stefano Brivio
2023-03-24  9:20     ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).