From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 875EB5A024D for ; Fri, 24 Mar 2023 10:21:20 +0100 (CET) Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4PjcCF6hw4z4xDj; Fri, 24 Mar 2023 20:21:13 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1679649673; bh=gzanp1GZ+uufh0qGsDTMzxg/pTMl4X3noFursOQD14A=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=VM56HAU0xEgXu4YeCosHvflxXl0r9TCwcLQH/4DQDaJLBSEqWVkf+64JCwJ0WaHRZ 1FmYYyvwnsL0rjZ0ax4rxkbtiKb2ocbC0F6X8g/0MElb0nKROJqKfXjN9+rtWp3SS2 mHpO9lNDSWamTLqGRrPUnwtyE6aiILYrW/yWLD/g= Date: Fri, 24 Mar 2023 20:20:24 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] tcp: Clear and set ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer Message-ID: References: <20230323160831.2206975-1-sbrivio@redhat.com> <20230324084259.431f9245@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="3Mj5xfUPgMA2XagU" Content-Disposition: inline In-Reply-To: <20230324084259.431f9245@elisabeth> Message-ID-Hash: RMTWP75D4MOJBIW6KUUFPIZZSAXYCT4C X-Message-ID-Hash: RMTWP75D4MOJBIW6KUUFPIZZSAXYCT4C X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --3Mj5xfUPgMA2XagU Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Mar 24, 2023 at 08:42:59AM +0100, Stefano Brivio wrote: > On Fri, 24 Mar 2023 16:18:11 +1100 > David Gibson wrote: >=20 > > On Thu, Mar 23, 2023 at 05:08:31PM +0100, Stefano Brivio wrote: > >=20 > > I'm pretty sure this will make things better than they were, so in > > that sense: > >=20 > > Reviewed-by: David Gibson > >=20 > > However, I'm not entirely convinced by some of the reasoning below. > >=20 > > > Since commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as > > > needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we > > > process an ACK segment, but, more correctly, only if we're really not > > > waiting for a further ACK segment, that is, only if the acknowledged > > > sequence matches what we sent. > > >=20 > > > In the new function implementing this, tcp_update_seqack_from_tap(), > > > we also reset the retransmission counter and store the updated ACK > > > sequence. Both should be done iff forward progress is acknowledged, > > > implied by the fact that the new ACK sequence is greater than the > > > one we previously stored. > > >=20 > > > At that point, it looked natural to also include the statements that > > > clear and set the ACK_FROM_TAP_DUE flag inside the same conditional > > > block: if we're not making forward progress, the need for an ACK, or > > > lack thereof, should remain unchanged. > > >=20 > > > There's a case where this isn't true, though: if a client initiates = =20 > >=20 > > Maybe mention this is a tap-side client specifically. > >=20 > > > a connection, and the server doesn't send data, with the client also > > > not sending any further data except for what's possibly sent along > > > with the first ACK segment following SYN, ACK from the server, we'll = =20 > >=20 > > This doesn't seem right. In my case the client *is* immediately > > sending data. It's *not* sending any in the ACK from the handshake > > (is that even allowed?). >=20 > Yes, and I see that sometimes with iperf3: Huh, interesting. I'm not that astonished that it's allowed. I'm moderately suprised something's doing it in practice, I'm *really* surprised it's not doing so non-deterministically. > $ tshark -r iperf_01.pcap=20 > 1 0.000000 10.131.1.134 =E2=86=92 10.128.2.170 58 TCP 52378 =E2=86= =92 5201 [SYN] Seq=3D0 Win=3D14600 Len=3D0 MSS=3D1398 > 2 0.000064 10.128.2.170 =E2=86=92 10.131.1.134 58 TCP 5201 =E2=86= =92 52378 [SYN, ACK] Seq=3D0 Ack=3D1 Win=3D43690 Len=3D0 MSS=3D65480 > 3 0.000152 10.131.1.134 =E2=86=92 10.128.2.170 91 TCP 52378 =E2=86= =92 5201 [ACK] Seq=3D1 Ack=3D1 Win=3D14600 Len=3D37 >=20 > the client could even send data with the SYN segment, but in that case it > shouldn't be delivered right away to the application, see RFC 793, > section 3.4: >=20 > Several examples of connection initiation follow. Although these > examples do not show connection synchronization using data-carrying > segments, this is perfectly legitimate, so long as the receiving TCP > doesn't deliver the data to the user until it is clear the data is > valid (i.e., the data must be buffered at the receiver until the > connection reaches the ESTABLISHED state). >=20 > This never happens in practice, and as far as I can tell the kernel > doesn't support this (see tcp_rcv_state_process() and > tcp_conn_request(), they won't buffer that data), so passt will also > discard this data, which needs to be re-sent (we only run on Linux, > and we maintain the same behaviour as if passt weren't there). Right, I saw that in the RFC, but it was pretty sure it was a thing that nobody's really done for a long time. Those old RFCs do sometimes have stuff that seemed like a neat idea at the time, but turned out not to be wise. I did wonder if there might be a later RFC banning or at least deprecating that practice. > But data on the third segment can actually happen, and arguably we > could say that the connection is ESTABLISHED at that point (there's > no need for well-defined sequence points or suchlike). Right, I was pretty sure that data on the SYN packets wasn't a thing in practice, but I wasn't sure about the third packet. > In that sense, the (p->count =3D=3D 1) check in tcp_tap_handler() on > TAP_SYN_RCVD isn't quite correct, and the client would need to > retransmit the data with the current implementation, which is not > ideal and could be improved. What's critical is that we don't reset > the connection. >=20 > > AFAICT it's just the fact that the (socket side) server doesn't send > > data which is relevant. >=20 > Maybe yes, I just couldn't reproduce this reliably with iperf3, and > the difference between working and failing cases seemed to be that. On > the other hand, your reproducer below shows this is not the case, so > fine, this is irrelevant. >=20 > > > never, in the established state of the connection, call > > > tcp_update_seqack_from_tap() with reported forward progress. > > >=20 > > > In that case, we'll never reset the initial ACK_FROM_TAP_DUE (used > > > to trigger handshake timeouts), interpret the trigger as a the need > > > for a retransmission (which won't actually retransmit anything), and > > > eventually time out a perfectly healthy connection once we reach the > > > maximum retransmission count. > > >=20 > > > This is relatively simple to reproduce if we switch back to 30s > > > iperf3 test runs, but it depends on the timing of the iperf3 client: > > > sometimes the first ACK from the client (part of the three-way > > > handshake) will come with data (and we'll hit the problem), sometimes > > > data will be sent later (and we call to tcp_update_seqack_from_tap() > > > from tcp_data_from_tap() later, avoiding the issue). =20 > >=20 > > This last bit seems wrong too. What I'm seeing is that > > tcp_update_seqack_from_tap() *is* called later from > > tcp_data_from_tap(), but it doesn't avoid the issue, because the > > from-client ack number hasn't advanced, so it doesn't do anything. >=20 > Right, same as above. >=20 > > > A reliable reproducer is a simpler: > > >=20 > > > $ strace -e accept,shutdown socat TCP-LISTEN:1111 STDIO & > > > [2] 2202832 > > > $ pasta --config-net -- sh -c '(sleep 30; echo a) | socat STDIN TCP= :88.198.0.161:1111' =20 > >=20 > > I assume 88.198.0.161 is the gateway address here? >=20 > Yes, see next paragraph of the commit message. >=20 > > > accept(5, {sa_family=3DAF_INET, sin_port=3Dhtons(57200), sin_addr= =3Dinet_addr("127.0.0.1")}, [16]) =3D 6 > > > shutdown(6, SHUT_RDWR) =3D 0 > > > --- SIGTTOU {si_signo=3DSIGTTOU, si_code=3DSI_KERNEL} --- > > > --- stopped by SIGTTOU --- > > > 2023/03/23 16:05:06 socat[3] E write(5, 0x5645dbca9000, 2): Connect= ion reset by peer =20 > >=20 > > That reproduces a problem, but not exactly the one I'm seeing (see > > notes above). Mine can be reproduced with: > >=20 > > $ strace -e accept,shutdown socat -u TCP-LISTEN:1111 OPEN:/dev/null,wro= nly & > > $ ./pasta --config-net -- socat -u OPEN:/dev/zero,rdonly TCP:192.168.17= =2E1:1111 > >=20 > > then waiting 30s. >=20 > Okay, great, I just tried, this is fixed as well. >=20 > > > where the socat client connects, and no data exchange happens for 30s > > > in either direction. Here, I'm using the default gateway address to > > > reach the socat server on the host. > > >=20 > > > Fix this by clearing and setting the ACK_FROM_TAP_DUE flag regardless > > > of reported forward progress. If we clear it when it's already unset, > > > conn_flag() will do nothing with it. If it was set, it's always fine > > > to reschedule the timer (as long as we're waiting for a further ACK), > > > because we just received an ACK segment, no matter its sequence. =20 > >=20 > > Hrm... is that actually true? Consider this situation: the server > > (socket side) sent some data that got lost, so the client (tap side) > > is not ever going to ack it. However, the client is continuing to > > send data, so we keep getting acks from it that don't make forward > > progress. Won't the change as proposed mean we then keep delaying the > > retransmit indefinitely? >=20 > Ah, yes, right, I wanted to keep it simple but in this case we can't > reschedule, I'm sending a v2 now. >=20 > > I've been thinking about this a bunch, and it's doing my head in a > > bit. However, I think the problem is that we have ACK_FROM_TAP_DUE > > set in the first place, when we don't actually have any outstanding > > sent data to ack. >=20 > That's not just for data, it's for any ACK, and timer_handler() also > uses that flag to cover handshake timeouts. >=20 > > I think that's happening because we're not clearing > > it on the very first ACK from the client - the one in the handshake. >=20 > Right, and with this change we'll clear that. We will, but with the side effect noted above. > > That's because of the + 1 in: > > tcp_seq_init(c, conn, now); > > conn->seq_ack_from_tap =3D conn->seq_to_tap + 1; > >=20 > > I think we have a subtle conflict of two reasonable seeming invariants > > here (written for the client on tap side case below, but there are > > variants for other cases, I think). > >=20 > > A) We expect an ack iff seq_to_tap > seq_ack_from_tap > >=20 > > According to this invariant, we want to remove the + 1 there. We > > advance seq_to_tap when we send the syn-ack, and ACK_FROM_TAP_DUE > > is set. seq_ack_from_tap catches up when we receive the handshake > > ack and ACK_FROM_TAP_DUE is cleared. > >=20 > > B) (seq_to_tap - seq_ack_from_tap) is equal to the number of bytes in > > the socket buffer which have been sent to the client at least once > >=20 > > This wants the + 1, because before the server sends data there's > > obviously nothing in the socket buffers, including during the > > handshake. > >=20 > > But... I think the only place that relies on (B) is > > tcp_data_from_sock(), and I don't think it ever gets called with > > !ESTABLISHED. >=20 > No, it doesn't. >=20 > > So.. I think we want to stick with invariant (A) and > > remove the "+ 1", for both the conn_from_sock and conn_from_tap > > variants. >=20 > In both places, the "+ 1" represents the SYN segment, which "counts as > one". RFC 793, section 3.1: >=20 > If SYN is present the sequence number is the > initial sequence number (ISN) and the first data octet is ISN+1. I realize it represents the SYN, but that's not the point. seq_ack_from_tap isn't the ack we *expect*, it's the one we've already gotten. By putting the + 1 here, we're treating the client as having acked the syn (or syn-ack) before it actually did. > However, while this was mentally convenient for me, I see how it > violates your expectation at A), so both "+ 1" could be replaced by > comments, really. > Right now I just want to fix this regression fast in a non-invasive > way (in the real world, it probably only happens with "long" iperf3 > tests, but that's, well, a common use case for passt). But then feel > free to patch that to fit A) above. >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --3Mj5xfUPgMA2XagU Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmQda0MACgkQzQJF27ox 2GeYBg//YxLQW8zCxsJftamV7WU6KrqIzR7GO6j/dA6sLTlDmIqmnjE+Mx/ijjD+ Wif2gnU0E5ZK3KmD7aq9iF1+G1BCFKp6T5A7qERHBCK1TFBIWwU09mMne/8N72cX SqYyfDWYGwa2TrkzPodOgVvaJWgbpKEuSi/NFxSJ/hODpRJxS+IiP8IEu+Xe3S2K RanerMWRyLc4iKEVTLShuCKKLZO0tOe1Vm1k3CA08uouKTCrp2ze1SMkB5yY+acS m8rMQ5Ot/c5m/ABD4wuFjB16gPQpCCJkl/2v52TwdyIJdzPR7ugs/LHvTR12wF6W POGt/vOXTnd8epazWDKBDtrKOfcyCbnQ+uaePGXMc8RDkuBJU/uIabqhUGM1lvIs 1AXuUUEF3ymKLeUlcAFjSjgwZj8va46fqatU8WSiG2rWdrMGsaK5waKUxp1MHil4 294HLRxReeFmLPMYo2+V/MHsswmn9xe0vL+NGnKCb7Ua4u8ixv0mM2M0ArgFlDUK WNputbJM4LKqDuCqBWZB1JEjVlFXlb1aa/zXMCB7jdcb1nbhql5/wDjdSjbs84nS o9ScWI1eH/RwsKg6Y812tv6SdsyySPYfwTbIKs7ZjmTi21bSR3/DP8y3uGuvWoqy 2d8pvXfZ2Gu4hLEZmWSNSXci5fDCngzzkT1/D5HdMH1jWzkZdE0= =MFFM -----END PGP SIGNATURE----- --3Mj5xfUPgMA2XagU--