From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 1BA475A0274 for ; Wed, 24 Apr 2024 02:48:18 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1713919691; bh=sevP1U3Kfe+plahEpmNpIFDKCx65P8mgA+CVNAbxkjA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=gwkg3xNMfYxYvsPeFlJtrYC/5VkVgvrdwrxBJqoCZ54exkjbjVQVog7raM80IZOj0 tDq0iLUV+8S9LcGvnZGCvZ+0DtK3EKacHkTvIKGbsB4LGdAECEWD1h3CZJaTMEwry2 kd2SapLU3daWfoKWTYrKELfB38K3G8p7i27irZjeJynbWRIvHBcTYviGVmfWFqiwOR gC3ZkhMpcwoiKYI0bnb4mBh7hzcSALJXU36oUsvJ5TzkuAhidEO3HIAAs146Jw5/pk SOoyd/ndc2Yy5AfwP0doggmsrGFaM7ico8uQ9yoF2NPoYQaONbpKC2M/SzkY/8IbTn 4O999s3cSJJgg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4VPL234Djvz4wcp; Wed, 24 Apr 2024 10:48:11 +1000 (AEST) Date: Wed, 24 Apr 2024 10:44:00 +1000 From: David Gibson To: Jon Maloy Subject: Re: [PATCH 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: References: <20240420191920.104876-1-jmaloy@redhat.com> <20240420191920.104876-2-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="GXCpIIXyQ9e4+el3" Content-Disposition: inline In-Reply-To: <20240420191920.104876-2-jmaloy@redhat.com> Message-ID-Hash: VJDISYBI7Y646NEVGFSHP6UEE23VDSLC X-Message-ID-Hash: VJDISYBI7Y646NEVGFSHP6UEE23VDSLC X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --GXCpIIXyQ9e4+el3 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Apr 20, 2024 at 03:19:19PM -0400, Jon Maloy wrote: > The kernel may support recvmsg(MSG_PEEK), starting reading data from a Not worth a respin on its own, but I think the comma above is misplaced, and for me makes the sentence much harder to read. > given offset set by the SO_PEEK_OFF socket option. This makes it > possible to avoid repeated reading of already read initial bytes of a > received message, hence saving read cycles when forwarding TCP messages > in the host->name space direction. >=20 > In this commit, we add functionality to leverage this feature when availa= ble, > while we fall back to the previous behavior when not. >=20 > Measurements with iperf3 shows that throughput increases with 15-20 perce= nt > in the host->namespace direction when this feature is used. >=20 > Signed-off-by: Jon Maloy > --- > tcp.c | 37 +++++++++++++++++++++++++++++++++++-- > 1 file changed, 35 insertions(+), 2 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 905d26f..95d400a 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -505,6 +505,7 @@ static struct tcp_buf_seq_update tcp6_l2_buf_seq_upda= te[TCP_FRAMES_MEM]; > static unsigned int tcp6_l2_buf_used; > =20 > /* recvmsg()/sendmsg() data for tap */ > +static bool peek_offset_cap =3D false; > static char tcp_buf_discard [MAX_WINDOW]; > static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; > =20 > @@ -582,6 +583,14 @@ static_assert(ARRAY_SIZE(tc_hash) >=3D FLOW_MAX, > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > =20 > +static void set_peek_offset(int s, int offset) > +{ > + if (!peek_offset_cap) > + return; > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > + perror("Failed to set SO_PEEK_OFF\n"); > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection state > * @events: Current connection events > @@ -1951,7 +1960,7 @@ static void tcp_conn_from_tap(struct ctx *c, > if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) > goto cancel; > } > - > + set_peek_offset(s, 0); > conn =3D &flow->tcp; > conn->f.type =3D FLOW_TCP; > conn->sock =3D s; > @@ -2174,6 +2183,15 @@ static int tcp_data_from_sock(struct ctx *c, struc= t tcp_tap_conn *conn) > if (iov_rem) > iov_sock[fill_bufs].iov_len =3D iov_rem; > =20 > + if (peek_offset_cap) { > + /* Don't use discard buffer */ > + mh_sock.msg_iov =3D &iov_sock[1]; > + mh_sock.msg_iovlen -=3D 1; > + > + /* Keep kernel sk_peek_off in synch */ > + set_peek_offset(s, already_sent); I thought we didn't need to set SO_PEEK_OFF here - that it would track on its own, and we only needed to change it for retransmits. I don't think we even need to calculate 'already_sent' when we have SO_PEEK_OFF. In fact - if we set already_sent to 0 here, it might make things a bit cleaner than having to have special cases for adjusting the iov and sendlen. > + } > + > /* Receive into buffers, don't dequeue until acknowledged by guest. */ > do > len =3D recvmsg(s, &mh_sock, MSG_PEEK); > @@ -2195,7 +2213,9 @@ static int tcp_data_from_sock(struct ctx *c, struct= tcp_tap_conn *conn) > return 0; > } > =20 > - sendlen =3D len - already_sent; > + sendlen =3D len; > + if (!peek_offset_cap) > + sendlen -=3D already_sent; > if (sendlen <=3D 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2718,6 +2738,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_= ref ref, > tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, > s, (struct sockaddr *)&sa)) > return; > + set_peek_offset(s, 0); > =20 > tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, > (struct sockaddr *)&sa, now); > @@ -3042,6 +3063,7 @@ static void tcp_sock_refill_init(const struct ctx *= c) > int tcp_init(struct ctx *c) > { > unsigned b; > + int s; > =20 > for (b =3D 0; b < TCP_HASH_TABLE_SIZE; b++) > tc_hash[b] =3D FLOW_SIDX_NONE; > @@ -3065,6 +3087,17 @@ int tcp_init(struct ctx *c) > NS_CALL(tcp_ns_socks_init, c); > } > =20 > + /* Probe for SO_PEEK_OFF support */ > + s =3D socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); > + if (s < 0) { > + perror("Temporary tcp socket creation failed\n"); > + } else { > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &(int){0}, sizeof(int))) { > + peek_offset_cap =3D true; > + } > + close(s); > + } > + printf("SO_PEEK_OFF%ssupported\n", peek_offset_cap ? " " : " not "); Should be an info(). > return 0; > } > =20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --GXCpIIXyQ9e4+el3 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmYoVb8ACgkQzQJF27ox 2Gfu/hAAkOr7feVhGzcdc99bu7f1BqsrE1IPskn4T0EvFoIvTYJM6pM/8H0fXJXA muWktoYC3xC7YA2OXSdee2EXG0M/MPI8g9o9Jaf4ic1nR020SUG8GSOzdb+qSeil z1d7LWXQrA8/xKL2eJRxSr3yXrQA6KJKgrJfb2xqHjy21dttDWQm2ggeUArJZbci /6AQBHKMWxLhvluO7OgmAI4wbed/VEjh1k23q8Zw2SsIPH3Ypf+k6QH3vwgx5hZh Vv7Cz/5ZJS28Xl2EFnelyTu35CMY48EjbIsTT3JqKWNqU2eNZw4XDi8KIxjyZXAR 9CCDQOR6L7dGz62KgrCbN6BN5zCC4GdtjvPi6fIvDkrDmihVJr7qsZ0fSWqjfpxv j/AU0AEIBBN1cuvNFaZc4/05xY5Hz5rigw5g4HQT56h9HwW0MApuE2nCVW9uV4Fy B0MM9YOM77NzoA3Mjq7S+Uiuqdf+h4rtKdCisvZ3BdVwPqIITIegMrq6rIu5/tRF fwM/cg4WW/ZFqL1AIBqn9V+IXCZzeCb1z9CYxWmUV1osHOpBf/lX1QYo1SBy9EYn 57S8wdlBfK4UUuQ5EEdAv9uffTIpXPGj8kE/jBy228PoeDuuSa3DCUTqagUYp9xD CiLYb7mhghMSjFVcfWS/TFxTRJJX4ObmtIywrrtlt0SPE17Yhkc= =32D+ -----END PGP SIGNATURE----- --GXCpIIXyQ9e4+el3--