From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id B87C95A0276 for ; Mon, 25 Sep 2023 07:52:19 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1695621133; bh=YEJrDUUFSrKvZi/4fo65HfK7n7zVF+l7owlmJ4meRk8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=mJr1bQSU4r8PxZdq2kwZyDqd5u83AkM+YvLlCBTHkQmL0gGikzY9I2FyIIuK+yruU fhhK6tFtXrVnK2rFh+kp+8zr9zgFaim+Ko32TQ+HlTZndqe/wu0KQdqjXxlXOeDpDL I+nv0gZP77rJxLR+QCThDjaqqM8K2XX1wnyY1pcM= Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4RvBpj2LlVz4xM7; Mon, 25 Sep 2023 15:52:13 +1000 (AEST) Date: Mon, 25 Sep 2023 14:47:52 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH RFT 4/5] tcp, tap: Don't increase tap-side sequence counter for dropped frames Message-ID: References: <20230922220610.58767-1-sbrivio@redhat.com> <20230922220610.58767-5-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="hm2KEGimLKyj4L1N" Content-Disposition: inline In-Reply-To: <20230922220610.58767-5-sbrivio@redhat.com> Message-ID-Hash: 2AH5FHFAE3ELRBWKMZJBCPXYJGH5XR7M X-Message-ID-Hash: 2AH5FHFAE3ELRBWKMZJBCPXYJGH5XR7M X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Matej Hrica , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --hm2KEGimLKyj4L1N Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Sep 23, 2023 at 12:06:09AM +0200, Stefano Brivio wrote: > ...so that we'll retry sending them, instead of more-or-less silently > dropping them. This happens quite frequently if our sending buffer on > the UNIX domain socket is heavily constrained (for instance, by the > 208 KiB default memory limit). >=20 > It might be argued that dropping frames is part of the expected TCP > flow: we don't dequeue those from the socket anyway, so we'll > eventually retransmit them. >=20 > But we don't need the receiver to tell us (by the way of duplicate or > missing ACKs) that we couldn't send them: we already know as > sendmsg() reports that. This seems to considerably increase > throughput stability and throughput itself for TCP connections with > default wmem_max values. >=20 > Unfortunately, the 16 bits left as padding in the frame descriptors I assume you're referring to the 'pad' fields in tcp[46]_l2_buf_t, yes? For AVX2 we have substantially more space here. Couldn't we put a conn (or seq) pointer in here at the cost of a few bytes MSS for non-AVX2 and zero cost for AVX2 (which is probably the majority case)? > we use internally aren't enough to uniquely identify for which > connection we should update sequence numbers: create a parallel > array of pointers to sequence numbers and L4 lengths, of > TCP_FRAMES_MEM size, and go through it after calling sendmsg(). >=20 > Signed-off-by: Stefano Brivio > --- > tap.c | 10 +++++++--- > tap.h | 2 +- > tcp.c | 43 ++++++++++++++++++++++++++++++++++++------- > 3 files changed, 44 insertions(+), 11 deletions(-) >=20 > diff --git a/tap.c b/tap.c > index 93db989..b30ff81 100644 > --- a/tap.c > +++ b/tap.c > @@ -413,13 +413,15 @@ static size_t tap_send_frames_passt(const struct ct= x *c, > * @c: Execution context > * @iov: Array of buffers, each containing one frame (with L2 headers) > * @n: Number of buffers/frames in @iov > + * > + * Return: number of frames actually sent > */ > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n) > { > size_t m; > =20 > if (!n) > - return; > + return 0; > =20 > if (c->mode =3D=3D MODE_PASST) > m =3D tap_send_frames_passt(c, iov, n); > @@ -427,9 +429,11 @@ void tap_send_frames(struct ctx *c, const struct iov= ec *iov, size_t n) > m =3D tap_send_frames_pasta(c, iov, n); > =20 > if (m < n) > - debug("tap: dropped %lu frames of %lu due to short send", n - m, n); > + debug("tap: failed to send %lu frames of %lu", n - m, n); > =20 > pcap_multiple(iov, m, c->mode =3D=3D MODE_PASST ? sizeof(uint32_t) : 0); > + > + return m; > } > =20 > /** > diff --git a/tap.h b/tap.h > index 021fb7c..952fafc 100644 > --- a/tap.h > +++ b/tap.h > @@ -73,7 +73,7 @@ void tap_icmp6_send(const struct ctx *c, > const struct in6_addr *src, const struct in6_addr *dst, > void *in, size_t len); > int tap_send(const struct ctx *c, const void *data, size_t len); > -void tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); > +size_t tap_send_frames(struct ctx *c, const struct iovec *iov, size_t n); > void tap_update_mac(struct tap_hdr *taph, > const unsigned char *eth_d, const unsigned char *eth_s); > void tap_listen_handler(struct ctx *c, uint32_t events); > diff --git a/tcp.c b/tcp.c > index 4606f17..76b7b8d 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -434,6 +434,16 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; > */ > static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; > =20 > +/** > + * tcp_buf_seq_update - Sequences to update with length of frames once s= ent > + * @seq: Pointer to sequence number sent to tap-side, to be updated > + * @len: TCP payload length > + */ > +struct tcp_buf_seq_update { > + uint32_t *seq; > + uint16_t len; > +}; > + > /* Static buffers */ > =20 > /** > @@ -462,6 +472,8 @@ static struct tcp4_l2_buf_t { > #endif > tcp4_l2_buf[TCP_FRAMES_MEM]; > =20 > +static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; > + > static unsigned int tcp4_l2_buf_used; > =20 > /** > @@ -490,6 +502,8 @@ struct tcp6_l2_buf_t { > #endif > tcp6_l2_buf[TCP_FRAMES_MEM]; > =20 > +static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; > + > static unsigned int tcp6_l2_buf_used; > =20 > /* recvmsg()/sendmsg() data for tap */ > @@ -1369,10 +1383,17 @@ static void tcp_l2_flags_buf_flush(struct ctx *c) > */ > static void tcp_l2_data_buf_flush(struct ctx *c) > { > - tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); > + unsigned i; > + size_t m; > + > + m =3D tap_send_frames(c, tcp6_l2_iov, tcp6_l2_buf_used); > + for (i =3D 0; i < m; i++) > + *tcp6_l2_buf_seq_update[i].seq +=3D tcp6_l2_buf_seq_update[i].len; > tcp6_l2_buf_used =3D 0; > =20 > - tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); > + m =3D tap_send_frames(c, tcp4_l2_iov, tcp4_l2_buf_used); > + for (i =3D 0; i < m; i++) > + *tcp4_l2_buf_seq_update[i].seq +=3D tcp4_l2_buf_seq_update[i].len; > tcp4_l2_buf_used =3D 0; > } > =20 > @@ -2149,10 +2170,11 @@ static int tcp_sock_consume(struct tcp_tap_conn *= conn, uint32_t ack_seq) > * @plen: Payload length at L4 > * @no_csum: Don't compute IPv4 checksum, use the one from previous buff= er > * @seq: Sequence number to be sent > - * @now: Current timestamp > + * @seq_update: Pointer to sequence number to update on successful send > */ > static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, > - ssize_t plen, int no_csum, uint32_t seq) > + ssize_t plen, int no_csum, uint32_t seq, > + uint32_t *seq_update) seq_update is always &conn->seq_to_tap, so there's no need for an additional parameter. > { > struct iovec *iov; > =20 > @@ -2160,6 +2182,9 @@ static void tcp_data_to_tap(struct ctx *c, struct t= cp_tap_conn *conn, > struct tcp4_l2_buf_t *b =3D &tcp4_l2_buf[tcp4_l2_buf_used]; > uint16_t *check =3D no_csum ? &(b - 1)->iph.check : NULL; > =20 > + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].seq =3D seq_update; > + tcp4_l2_buf_seq_update[tcp4_l2_buf_used].len =3D plen; > + > iov =3D tcp4_l2_iov + tcp4_l2_buf_used++; > iov->iov_len =3D tcp_l2_buf_fill_headers(c, conn, b, plen, > check, seq); > @@ -2168,6 +2193,9 @@ static void tcp_data_to_tap(struct ctx *c, struct t= cp_tap_conn *conn, > } else if (CONN_V6(conn)) { > struct tcp6_l2_buf_t *b =3D &tcp6_l2_buf[tcp6_l2_buf_used]; > =20 > + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].seq =3D seq_update; > + tcp6_l2_buf_seq_update[tcp6_l2_buf_used].len =3D plen; > + > iov =3D tcp6_l2_iov + tcp6_l2_buf_used++; > iov->iov_len =3D tcp_l2_buf_fill_headers(c, conn, b, plen, > NULL, seq); > @@ -2193,7 +2221,7 @@ static int tcp_data_from_sock(struct ctx *c, struct= tcp_tap_conn *conn) > int s =3D conn->sock, i, ret =3D 0; > struct msghdr mh_sock =3D { 0 }; > uint16_t mss =3D MSS_GET(conn); > - uint32_t already_sent; > + uint32_t already_sent, seq; > struct iovec *iov; > =20 > already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > @@ -2282,14 +2310,15 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > =20 > /* Finally, queue to tap */ > plen =3D mss; > + seq =3D conn->seq_to_tap; This will only be correct if tcp_l2_data_buf_flush() is *always* called between tcp_data_from_sock() calls for the same socket. That should be true for the normal course of things. However, couldn't it happen that we get a normal socket EPOLLIN event for a particular connection - calling tcp_data_from_sock() - but in the same epoll() round we also get a tap ack for the same connection which causes another call to tcp_data_from_sock() (with the change from patch 2/5). IIRC those would both happen before the deferred handling and therefore the data_buf_flush(). Not sure how to deal with that short of separate 'seq_queued' and 'seq_sent' counters in the connection structure, which is a bit unfortunate. > for (i =3D 0; i < send_bufs; i++) { > int no_csum =3D i && i !=3D send_bufs - 1 && tcp4_l2_buf_used; > =20 > if (i =3D=3D send_bufs - 1) > plen =3D last_len; > =20 > - tcp_data_to_tap(c, conn, plen, no_csum, conn->seq_to_tap); > - conn->seq_to_tap +=3D plen; > + tcp_data_to_tap(c, conn, plen, no_csum, seq, &conn->seq_to_tap); > + seq +=3D plen; > } > =20 > conn_flag(c, conn, ACK_FROM_TAP_DUE); --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --hm2KEGimLKyj4L1N Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmUREPEACgkQzQJF27ox 2GfPexAAqLD99lEKriK+02iGHA/ruo5pa1Bp9ZFNCVW8MOnIkejOcYH6GmpYwk9D CaFWSoOs9XBU0AKXxd5a3I+RbogutFuEJiXQRHvtSw0k8YzskLAVJalZbfDUf/cc a56vnPcXzmYBQZEN6E+8aipGEZUJrwlsIUzJYVASNAi4U6eCkjVTpyebfDLx4EV2 YGYVwNV21giVa7T9LxY9xl3AyxyvDT/PP7v3KxAMCPxhCGwR93lJakE5f6Dxdmqj 4oa4Fl8YlPiVguWzdjlMW99VO68J4KkMZ84s1OX8CrJrzdkxeTmUAFlujndHwVu8 EQZReGi4lewgPJTZ6Fa2k/ve46DMw3VXcgCfxY/WwhYxT48BGVEqkzvFzA+fQ4BV HDGo+7djCp6I/gXgCqJfOlsIw2rJW/5phzGiFrKIsZYc/Rc8j/93Z6M996I7v9KC wTm7xlW911fDZBeDduq99z62p+c+bB8d89XOIS3jap0PNieV3/Dl+XtJM36k6bYE QrFBrBy9ROnIitLY2h3wsaEiMoyc/NN/3uR7lqLD3od0r1Y/IuldWpSkK7ZGqCGn g//UjK6RDI6Ke9/W/3my9C69AWXoEKBP0EPXAwL9zT+7FLrXnjSJcZy5gnxBz+pV Li4xL0NVaOZAEZcnrYkkg5VdQPQfoQiAvKYELojJlOq5mBqPwKg= =Mcvg -----END PGP SIGNATURE----- --hm2KEGimLKyj4L1N--