From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 2A63C5A0272 for ; Thu, 28 Sep 2023 04:03:43 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1695866617; bh=GAfyv+KU6/ro4RGk7XV6mwx+aHAyzeObywnetG0C1HI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=H8yQbOyE2zG2sdPJL7vNF3VpzDGEkuqzVWuhRccE/Fu7vANt4xXs5KGRVEceQC7e+ 9eJGP58p7kLwNlCU70/1RcpqJKcmWjudgVigZVoZiROTQXM5FJQzT99KEiMrXo4N+a rijlmeHZ7MFMYgjzf5gEKEP/t96/4PchiwMzPFz4= Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4RwxbY4m8kz4xQ4; Thu, 28 Sep 2023 12:03:37 +1000 (AEST) Date: Thu, 28 Sep 2023 11:58:45 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames Message-ID: References: <20230922220610.58767-1-sbrivio@redhat.com> <20230922220610.58767-5-sbrivio@redhat.com> <20230927190603.10a1ed74@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="e/x2YrKwGqmBUtL1" Content-Disposition: inline In-Reply-To: <20230927190603.10a1ed74@elisabeth> Message-ID-Hash: DOVP4S5ZV5XEA3DMJLWDAUOEDORYE7S7 X-Message-ID-Hash: DOVP4S5ZV5XEA3DMJLWDAUOEDORYE7S7 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Matej Hrica , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --e/x2YrKwGqmBUtL1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Sep 27, 2023 at 07:06:03PM +0200, Stefano Brivio wrote: > On Mon, 25 Sep 2023 14:47:52 +1000 > David Gibson wrote: >=20 > > On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote: > > > ...so that we'll retry sending them, instead of more-or-less silently > > > dropping them. This happens quite frequently if our sending buffer on > > > the UNIX domain socket is heavily constrained (for instance, by the > > > 208 KiB default memory limit). > > >=20 > > > It might be argued that dropping frames is part of the expected TCP > > > flow: we don't dequeue those from the socket anyway, so we'll > > > eventually retransmit them. > > >=20 > > > But we don't need the receiver to tell us (by the way of duplicate or > > > missing ACKs) that we couldn't send them: we already know as > > > sendmsg() reports that. This seems to considerably increase > > > throughput stability and throughput itself for TCP connections with > > > default wmem_max values. > > >=20 > > > Unfortunately, the 16 bits left as padding in the frame descriptors = =20 > >=20 > > I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, > > yes? >=20 > Right, that. >=20 > > For AVX2 we have substantially more space here. Couldn't we put > > a conn (or seq) pointer in here at the cost of a few bytes MSS for > > non-AVX2 and zero cost for AVX2 (which is probably the majority case)? >=20 > Yes, true. On the other hand, having this parallel array only affects > readability I guess, whereas inserting pointers and lengths in > tcp[46]_l2_buf_t actually decreases the usable MSS (not just on > non-AVX2 x86, but also on other architectures). So I'd rather stick to > this. Yeah, I guess so. Actually.. I did just think of one other option. It avoids both any extra padding and a parallel array, but at the cost of additional work when frames are dropped. We could use that 16-bits of padding to store the TCP payload length. Then when we don't manage to send all our frames, we do another loop through and add up how many stream bytes we actually sent to update the seq pointer. > > > we use internally aren't enough to uniquely identify for which > > > connection we should update sequence numbers: create a parallel > > > array of pointers to sequence numbers and L4 lengths, of > > > TCP_FRAMES_MEM size, and go through it after calling sendmsg(). > > >=20 > > > Signed-off-by: Stefano Brivio > > > --- > > > tap.c | 10 +++++++--- > > > tap.h | 2 +- > > > tcp.c | 43 ++++++++++++++++++++++++++++++++++++------- > > > 3 files changed, 44 insertions(+), 11 deletions(-) > > >=20 > > > diff --git a/tap.c b/tap.c > > > index 93db989..b30ff81 100644 > > > --- a/tap.c > > > +++ b/tap.c > > > @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struc= t ctx *c, > > > * @c: Execution context > > > * @iov: Array of buffers, each containing one frame (with L2 header= s) > > > * @n: Number of buffers/frames in @iov > > > + * > > > + * Return: number of frames actually sent > > > */ > > > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t = n) > > > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_= t n) > > > { > > > size_t m; > > > =20 > > > if (!n) > > > - return; > > > + return 0; > > > =20 > > > if (c->mode =3D=3D MODE_PASST) > > > m =3D tap_send_frames_passt(c, iov, n); > > > @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct= iovec *iov, size_t n) > > > m =3D tap_send_frames_pasta(c, iov, n); > > > =20 > > > if (m < n) > > > - debug("tap: dropped %lu frames of %lu due to short send", n - m, n= ); > > > + debug("tap: failed to send %lu frames of %lu", n - m, n); > > > =20 > > > pcap_multiple(iov, m, c->mode =3D=3D MODE_PASST ? sizeof(uint32_t) = : 0); > > > + > > > + return m; > > > } > > > =20 > > > /** > > > diff --git a/tap.h b/tap.h > > > index 021fb7c..952fafc 100644 > > > --- a/tap.h > > > +++ b/tap.h > > > @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c, > > > const struct in6_addr *src, const struct in6_addr *dst, > > > void *in, size_t len); > > > int tap_send(const struct ctx *c, const void *data, size_t len); > > > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t = n); > > > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_= t n); > > > void tap_update_mac(struct tap_hdr *taph, > > > const unsigned char *eth_d, const unsigned char *eth_s); > > > void tap_listen_handler(struct ctx *c, uint32_t events); > > > diff --git a/tcp.c b/tcp.c > > > index 4606f17..76b7b8d 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; > > > */ > > > static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; > > > =20 > > > +/** > > > + * tcp_buf_seq_update - Sequences to update with length of frames on= ce sent > > > + * @seq: Pointer to sequence number sent to tap-side, to be updated > > > + * @len: TCP payload length > > > + */ > > > +struct tcp_buf_seq_update { > > > + uint32_t *seq; > > > + uint16_t len; > > > +}; > > > + > > > /* Static buffers */ > > > =20 > > > /** > > > @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t { > > > #endif > > > tcp4_l2_buf[TCP_FRAMES_MEM]; > > > =20 > > > +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_M= EM]; > > > + > > > static unsigned int tcp4_l2_buf_used; > > > =20 > > > /** > > > @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t { > > > #endif > > > tcp6_l2_buf[TCP_FRAMES_MEM]; > > > =20 > > > +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_M= EM]; > > > + > > > static unsigned int tcp6_l2_buf_used; > > > =20 > > > /* recvmsg()/sendmsg() data for tap */ > > > @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx= *c) > > > */ > > > static void tcp_l2_data_buf_flush(struct ctx *c) > > > { > > > - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); > > > + unsigned i; > > > + size_t m; > > > + > > > + m =3D tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); > > > + for (i =3D 0; i < m; i++) > > > + *tcp6_l2_buf_seq_update[i].seq +=3D tcp6_l2_buf_seq_update[i].len; > > > tcp6_l2_buf_used =3D 0; > > > =20 > > > - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); > > > + m =3D tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); > > > + for (i =3D 0; i < m; i++) > > > + *tcp4_l2_buf_seq_update[i].seq +=3D tcp4_l2_buf_seq_update[i].len; > > > tcp4_l2_buf_used =3D 0; > > > } > > > =20 > > > @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_co= nn *conn, uint32_t ack_seq) > > > * @plen: Payload length at L4 > > > * @no_csum: Don't compute IPv4 checksum, use the one from previous = buffer > > > * @seq: Sequence number to be sent > > > - * @now: Current timestamp > > > + * @seq_update: Pointer to sequence number to update on successful s= end > > > */ > > > static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, > > > - ssize_t plen, int no_csum, uint32_t seq) > > > + ssize_t plen, int no_csum, uint32_t seq, > > > + uint32_t *seq_update) =20 > >=20 > > seq_update is always &conn->seq_to_tap, so there's no need for an > > additional parameter. >=20 > Oh, right, I'll drop that. >=20 > > > { > > > struct iovec *iov; > > > =20 > > > @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, stru= ct tcp_tap_conn *conn, > > > struct tcp4_l2_buf_t *b =3D &tcp4_l2_buf[tcp4_l2_buf_used]; > > > uint16_t *check =3D no_csum ? &(b - 1)->iph.check : NULL; > > > =20 > > > + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq =3D seq_update; > > > + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len =3D plen; > > > + > > > iov =3D tcp4_l2_iov + tcp4_l2_buf_used++; > > > iov->iov_len =3D tcp_l2_buf_fill_headers(c, conn, b, plen, > > > check, seq); > > > @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, stru= ct tcp_tap_conn *conn, > > > } else if (CONN_V6(conn)) { > > > struct tcp6_l2_buf_t *b =3D &tcp6_l2_buf[tcp6_l2_buf_used]; > > > =20 > > > + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq =3D seq_update; > > > + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len =3D plen; > > > + > > > iov =3D tcp6_l2_iov + tcp6_l2_buf_used++; > > > iov->iov_len =3D tcp_l2_buf_fill_headers(c, conn, b, plen, > > > NULL, seq); > > > @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, st= ruct tcp_tap_conn *conn) > > > int s =3D conn->sock, i, ret =3D 0; > > > struct msghdr mh_sock =3D { 0 }; > > > uint16_t mss =3D MSS_GET(conn); > > > - uint32_t already_sent; > > > + uint32_t already_sent, seq; > > > struct iovec *iov; > > > =20 > > > already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > > > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, = struct tcp_tap_conn *conn) > > > =20 > > > /* Finally, queue to tap */ > > > plen =3D mss; > > > + seq =3D conn->seq_to_tap; =20 > >=20 > > This will only be correct if tcp_l2_data_buf_flush() is *always* > > called between tcp_data_from_sock() calls for the same socket. That > > should be true for the normal course of things. However, couldn't it > > happen that we get a normal socket EPOLLIN event for a particular > > connection - calling tcp_data_from_sock() - but in the same epoll() > > round we also get a tap ack for the same connection which causes > > another call to tcp_data_from_sock() (with the change from patch > > 2/5). IIRC those would both happen before the deferred handling and > > therefore the data_buf_flush(). >=20 > Ah, yes, I actually wrote this before 2/5 and concluded it was okay :/ > but with that change, it's not. Unless we drop that change from 2/5. Even if we drop the change, it's a worryingly subtle constraint. > > Not sure how to deal with that short of separate 'seq_queued' and > > 'seq_sent' counters in the connection structure, which is a bit > > unfortunate. >=20 > I wonder how bad it is if we call tcp_l2_data_buf_flush() > unconditionally before calling tcp_data_from_sock() from > tcp_tap_handler(). But again, maybe this is not needed at all, we > should check that epoll detail from 2/5 first... >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --e/x2YrKwGqmBUtL1 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIyBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmUU3c4ACgkQzQJF27ox 2GenkA/4gz7XP2jr0oaoxzDb3E/XQ56kmQJGfHXcipKt0hrhFKef+J2YEqjTace0 3PLghOQ/HO4vCmGKd9ivIgkOEsZYl2sMD9d0wAW7o/+6FUHbSOIEgGUK/cNkaDnt bQqs6oQjea6wGlXxhz60LTFs4REUnvcm40CxtXEOq6652Ub6t0aMvMMYVYP+O8jX tncA167Ix9yRRITU+EjuAtpX+gI4SLc/wJGlkw2b5nb6tsaYA9PP1qUU3pnKW/nJ e9RB7dwz80IXLc83QGOpN41XUY4Ta68NEeMO7mTJROk8GCMm85dWF124gr3aXFor ioxiZOlyc425j1tk5bhyV7h6qJuai4NsokdgkQ2+aSH6HosDp5ekKVk65PkAwEYB OHqShJ4Jn4JV3pVmyrqLPnk1HBcrGW9eufg6cdL0Z+xRwKp4yVDYSKN5G5Xn7d13 kTbHxw9naeJ5muf/MNVIy83072kizSMOU9Xf/JB52j78KRPZXLsrZgktKJ6zSlDU Az9eKZMHThNwtv/IPe5iUQbyHWwS6tR3qTqNl8QJ4w+7TcbyfPsg7gUw1a1oMzcA gdnKnM36GBQ0tzHH82koijXtwXw6ICQsVfvz9IvuhPDyTUt2GdFgRF0joxkAL0Y0 Hi42yhfTyVNJ2K8ub3OT79IbFSroLzenFcsHyPpEELqql3Wb/Q== =0Bif -----END PGP SIGNATURE----- --e/x2YrKwGqmBUtL1--