From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 557395A0272 for ; Mon, 5 Feb 2024 07:13:48 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1707113625; bh=nGED+fvcSzyK3d7BgJvHCEB5fiAhu7M77uaysEIaiMI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=C5YgYQ94Sd92sxHYJr9Kj+zTnq0jAl5MEwtypITsL6M1mKxagMARmrsmU0CkIGcVM nLW0Zuqyuw7VZqKH8G4Q1/XtVPssADfJtDsd/tWKwLGMbqHYfwwBOWqQycSfHbfgD9 ipg6g9bG1BVbfDV4VtjdoTS+urVk22RGZVj9yQ3ZjNvkOTEWWbJTcbIFxVXUhxIGWH DznhyOzXT9d0DbiqGWT6+4v+XpQjXfPOIVOAeS/4hLyjfnDXDas6ej59bpbOUAaBgT 3+JdWdFMUDU+mt1/f6bHDb9UfL/Ie9XycG6/+k7pWx2Pc2wjDL5SQyBKyPlGhiobQE nYOU2dsBUzOFg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4TSx091hRNz4wyb; Mon, 5 Feb 2024 17:13:45 +1100 (AEDT) Date: Mon, 5 Feb 2024 16:42:35 +1100 From: David Gibson To: Jon Maloy Subject: Re: tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: References: <20240201221211.1242958-1-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="73Iue7GcYywnAtmC" Content-Disposition: inline In-Reply-To: <20240201221211.1242958-1-jmaloy@redhat.com> Message-ID-Hash: MSRNROY6FHAXOT54AWWIFXTG7G3E7B6A X-Message-ID-Hash: MSRNROY6FHAXOT54AWWIFXTG7G3E7B6A X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --73Iue7GcYywnAtmC Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Feb 01, 2024 at 05:12:11PM -0500, Jon Maloy wrote: > The kernel may support recvmsg(MSG_PEEK), starting reading data from a > given offset set by the SO_PEEK_OFF socket option. This makes it > possible to avoid repeated reading of already read initial bytes of a > received message, hence saving read cycles when forwarding TCP messages > in the host->name space direction. >=20 > In this commit, we add functionality to leverage this feature when availa= ble, > while we fall back to the previous behavior when not. >=20 > Measurements with iperf3 shows that throughput increases with 15-20 perce= nt > in the host->namespace direction when this feature is used. >=20 > Signed-off-by: Jon Maloy > --- > tcp.c | 35 +++++++++++++++++++++++++++++++++-- > 1 file changed, 33 insertions(+), 2 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 905d26f..58674eb 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -505,6 +505,7 @@ static struct tcp_buf_seq_update tcp6_l2_buf_seq_upda= te[TCP_FRAMES_MEM]; > static unsigned int tcp6_l2_buf_used; > =20 > /* recvmsg()/sendmsg() data for tap */ > +static bool peek_offset_cap =3D false; > static char tcp_buf_discard [MAX_WINDOW]; > static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; > =20 > @@ -582,6 +583,14 @@ static_assert(ARRAY_SIZE(tc_hash) >=3D FLOW_MAX, > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > =20 A function comnment would be nice. > +static void set_peek_offset(int s, int offset) > +{ > + if (!peek_offset_cap) > + return; > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > + perror("Failed to set SO_PEEK_OFF\n"); > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection state > * @events: Current connection events > @@ -1951,7 +1960,7 @@ static void tcp_conn_from_tap(struct ctx *c, > if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) > goto cancel; > } > - > + set_peek_offset(s, 0); > conn =3D &flow->tcp; > conn->f.type =3D FLOW_TCP; > conn->sock =3D s; > @@ -2174,6 +2183,12 @@ static int tcp_data_from_sock(struct ctx *c, struc= t tcp_tap_conn *conn) > if (iov_rem) > iov_sock[fill_bufs].iov_len =3D iov_rem; > =20 > + /* Don't use discard buffer if SO_PEEK_OFF is supported. */ > + if (peek_offset_cap) { > + mh_sock.msg_iov =3D &iov_sock[1]; > + mh_sock.msg_iovlen -=3D 1; > + } > + > /* Receive into buffers, don't dequeue until acknowledged by guest. */ > do > len =3D recvmsg(s, &mh_sock, MSG_PEEK); > @@ -2195,7 +2210,9 @@ static int tcp_data_from_sock(struct ctx *c, struct= tcp_tap_conn *conn) > return 0; > } > =20 > - sendlen =3D len - already_sent; > + sendlen =3D len; > + if (!peek_offset_cap) > + sendlen -=3D already_sent; > if (sendlen <=3D 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2367,6 +2384,7 @@ static int tcp_data_from_tap(struct ctx *c, struct = tcp_tap_conn *conn, > max_ack_seq, conn->seq_to_tap); > conn->seq_ack_from_tap =3D max_ack_seq; > conn->seq_to_tap =3D max_ack_seq; > + set_peek_offset(conn->sock, 0); I don't think a zero offset is entirely correct here. Instead I think it should be (max_ack_seq - conn->seq_ack_from_tap). We're retransmitting from the segment immediately after max_ack_seq, so we want that offset, minus the offset of the the "TRUNC" pointer in the stream, which is conn->seq_ack_from_tap. Now, usually that will be 0, because tcp_sock_consume() just above will have advanced the TRUNC pointer up to max_ack_seq, so tcp_update_seqack_from_tap() will have set conn->seq_ack_from_tap equal to max_ack_seq. However, tcp_sock_consume() can fail, in which case neither the TRUNC pointer nor conn->seq_ack_from_tap will be advanced. > tcp_data_from_sock(c, conn); > } > =20 > @@ -2718,6 +2736,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_= ref ref, > tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, > s, (struct sockaddr *)&sa)) > return; > + set_peek_offset(s, 0); Since this is specific to "tap" connections, I think this would be better moved into tcp_tap_conn_from_sock(). > tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, > (struct sockaddr *)&sa, now); > @@ -3042,6 +3061,7 @@ static void tcp_sock_refill_init(const struct ctx *= c) > int tcp_init(struct ctx *c) > { > unsigned b; > + int s; > =20 > for (b =3D 0; b < TCP_HASH_TABLE_SIZE; b++) > tc_hash[b] =3D FLOW_SIDX_NONE; > @@ -3065,6 +3085,17 @@ int tcp_init(struct ctx *c) > NS_CALL(tcp_ns_socks_init, c); > } > =20 > + /* Probe for SO_PEEK_OFF support */ > + s =3D socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); > + if (s < 0) { > + perror("Temporary tcp socket creation failed\n"); > + } else { > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &(int){0}, sizeof(int))) { > + peek_offset_cap =3D true; > + } > + close(s); > + } > + printf("SO_PEEK_OFF%ssupported\n", peek_offset_cap ? " " : " not "); > return 0; > } I think you're also missing one call to set_peek_offset(). You've got a call for the fast re-transmit case (duplicate ack), but not for the "slow retransmit" case (ack timeout), which is in tcp_timer_handler(). *thinks* the peek pointer basically needs to always be kept in sync with conn->seq_to_tap. When seq_to_tap is advanced because we've peeked more data, the kernel should automatically update the peek pointer to match. But any other place we adjust seq_to_tap probably needs a setsockopt(). That looks to be the "slow" retransmit case in tcp_timer_handler(), but also the "ack sequence gap" bit at the top of tcp_data_from_sock(). I'm not really sure what that case is about, but I think we need to adjust SO_PEEK_OFF. I wonder if it would be safe to have a helper function that adjusts both seq_to_tap and SO_PEEK_OFF. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --73Iue7GcYywnAtmC Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmXAdUsACgkQzQJF27ox 2GcGDhAAjKhgxO/68LmItQ9UtzDpTItZLAHfhU/1FMPGQnPlq1fSM96U8IUyo8S9 StrdQzCrVNiS4JoVLYwz4d7RwQ6KGMB9sFACl1ezZwPEf8vOJBlpAtJbB7EoA7Tu BsO2JDbpDXjg+EXLpFdwfMSdg0iZpGZql0k44lH2sY+PYKPZlW82aOC9/3/1T4LF W0i++uV8dbYUI9aobQh4nCMaA9XpZ75WBNvkvHS8CEYMiV0jLl0EpNtum+6Umv5k G6tSBl+vSlnVc4EVU06TfhouB67hn/pqm3lMXr7/7970MwdPDRxHvG61NDQGfmAS AG8yauaRnrWSq6B5CrO0yF7kW0nXVrX1sQvUviZe2A9uFcsZ6DDaQIDSnkwPdYWh qYYg8oafzoZ2K6raVNn5j5eL16be3lH/e3yOl/uC4+nI57n0J1FJLVTnXwdWxCK+ Vvq4ko14yGTfllWK8hJjt6Odx1k4Vs4qry8ZJHQE1YXgAK68AmkLzYXVIRWvovML VPyWusnk8lQpA58oOLOj6A4LN+wgk9R+Dx1m+YMNx/mjNBCim5VByiYbOw1Qc1cL j6Cm2yA4M/Ihj4ILnd3On6Jo0YA6RIDgaHXuu3GaNF/0jyYuN1Tu5hQwxVSELXUB 1YcyDJwCiOBrZOYG2pmL36K3yqPcII0UA3ecxIksWo86l6HohwI= =ivuU -----END PGP SIGNATURE----- --73Iue7GcYywnAtmC--