From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 59D665A02CF for ; Thu, 16 May 2024 04:30:05 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1715826600; bh=0+zkGlLyginUF4cHc06MZBO+2jdCGkLRNQRHiQQP3zk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=H1w5yER/ct5D1ZWGTj400xxxXDz5KbTU7hugoBXm0SzSaRaDpoBfBLuurgnJ4VYDp hJ2FEAAjkIc1vzI5Q43rCvIA170HOOBFfJ+bF9bvb9YA+nmuLDd+oNhBlaoRvuWlXi AmV0me/yMl3o+snwpefRJqEtRa3tc+V90j+bk6tHadtySK8cqNoQzF2RtZHEuga12z vTzWQJpZmGAkBrIOcPO64v2P5LR0Dm8oFd5sJ+8drbqHNG4xSY0x+0pGfgivz4yB9W rZrmFsdg2Xuksq5O5gO2OY0MeIgJ8206b5MF6dKM8qEE6O16uJx9uiJ+oxB0D1onYb nsOcr2u5Z+gzA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4VfvFN3RhQz4wx6; Thu, 16 May 2024 12:30:00 +1000 (AEST) Date: Thu, 16 May 2024 12:29:57 +1000 From: David Gibson To: Jon Maloy Subject: Re: [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: References: <20240515153429.859185-1-jmaloy@redhat.com> <20240515153429.859185-3-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="Yd7/lGpKXd6hhASv" Content-Disposition: inline In-Reply-To: <20240515153429.859185-3-jmaloy@redhat.com> Message-ID-Hash: JO5XZDOMXNX442BO4E5S6BIBIIJJEHIJ X-Message-ID-Hash: JO5XZDOMXNX442BO4E5S6BIBIIJJEHIJ X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --Yd7/lGpKXd6hhASv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, May 15, 2024 at 11:34:28AM -0400, Jon Maloy wrote: > >From linux-6.9.0 the kernel will contain > commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). >=20 > This new feature makes is possible to call recv_msg(MSG_PEEK) and make > it start reading data from a given offset set by the SO_PEEK_OFF socket > option. This way, we can avoid repeated reading of already read bytes of > a received message, hence saving read cycles when forwarding TCP > messages in the host->name space direction. >=20 > In this commit, we add functionality to leverage this feature when > available, while we fall back to the previous behavior when not. >=20 > Measurements with iperf3 shows that throughput increases with 15-20 > percent in the host->namespace direction when this feature is used. >=20 > Signed-off-by: Jon Maloy I'm pretty sure this needs one more call to tcp_set_peek_offset() inside the revert function introduced in the previous patch. Otherwise looks good. >=20 > --- > v2: - Some smaller changes as suggested by David Gibson and Stefano Brivi= o. > - Moved initial set_peek_offset(0) to only the locations where the so= cket is set > to ESTABLISHED. > - Removed the per-packet synchronization between sk_peek_off and > already_sent. Instead only doing it in retransmit situations. > - The problem I found when trouble shooting the occasionally occurring > out of synch values between 'already_sent' and 'sk_peek_offset' may > have deeper implications that we may need to be investigate. >=20 > v3: - Rebased to most recent version of tcp.c, plus the previous > patch in this series. > - Some changes based on feedback from PASST team >=20 > v4: - Some small changes based on feedback from Stefan/David. > --- > tcp.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 50 insertions(+), 8 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 976dba8..4163bf9 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -511,6 +511,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP= _NUM_IOVS]; > static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > =20 > +/* Does the kernel support TCP_PEEK_OFF? */ > +static bool peek_offset_cap; > + > /* sendmsg() to socket */ > static struct iovec tcp_iov [UIO_MAXIOV]; > =20 > @@ -526,6 +529,20 @@ static_assert(ARRAY_SIZE(tc_hash) >=3D FLOW_MAX, > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > =20 > +/** > + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if support= ed > + * @s: Socket to update > + * @offset: Offset in bytes > + */ > +static void tcp_set_peek_offset(int s, int offset) > +{ > + if (!peek_offset_cap) > + return; > + > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > + err("Failed to set SO_PEEK_OFF to %u in socket %i", offset, s); > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection state > * @events: Current connection events > @@ -2197,14 +2214,15 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > uint32_t already_sent, seq; > struct iovec *iov; > =20 > + /* How much have we read/sent since last received ack ? */ > already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > - > if (SEQ_LT(already_sent, 0)) { > /* RFC 761, section 2.1. */ > flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > conn->seq_ack_from_tap, conn->seq_to_tap); > conn->seq_to_tap =3D conn->seq_ack_from_tap; > already_sent =3D 0; > + tcp_set_peek_offset(s, 0); > } > =20 > if (!wnd_scaled || already_sent >=3D wnd_scaled) { > @@ -2222,11 +2240,16 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > iov_rem =3D (wnd_scaled - already_sent) % mss; > } > =20 > - mh_sock.msg_iov =3D iov_sock; > - mh_sock.msg_iovlen =3D fill_bufs + 1; > - > - iov_sock[0].iov_base =3D tcp_buf_discard; > - iov_sock[0].iov_len =3D already_sent; > + /* Prepare iov according to kernel capability */ > + if (!peek_offset_cap) { > + mh_sock.msg_iov =3D iov_sock; > + iov_sock[0].iov_base =3D tcp_buf_discard; > + iov_sock[0].iov_len =3D already_sent; > + mh_sock.msg_iovlen =3D fill_bufs + 1; > + } else { > + mh_sock.msg_iov =3D &iov_sock[1]; > + mh_sock.msg_iovlen =3D fill_bufs; > + } > =20 > if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || > (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { > @@ -2267,7 +2290,10 @@ static int tcp_data_from_sock(struct ctx *c, struc= t tcp_tap_conn *conn) > return 0; > } > =20 > - sendlen =3D len - already_sent; > + sendlen =3D len; > + if (!peek_offset_cap) > + sendlen -=3D already_sent; > + > if (sendlen <=3D 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2438,6 +2464,7 @@ static int tcp_data_from_tap(struct ctx *c, struct = tcp_tap_conn *conn, > "fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); > conn->seq_to_tap =3D max_ack_seq; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > } > =20 > @@ -2530,6 +2557,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c= , struct tcp_tap_conn *conn, > conn->seq_ack_to_tap =3D conn->seq_from_tap; > =20 > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > =20 > /* The client might have sent data already, which we didn't > * dequeue waiting for SYN,ACK from tap -- check now. > @@ -2610,6 +2638,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_= family_t af, > goto reset; > =20 > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > =20 > if (th->fin) { > conn->seq_from_tap++; > @@ -2863,6 +2892,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r= ef ref) > flow_dbg(conn, "ACK timeout, retry"); > conn->retrans++; > conn->seq_to_tap =3D conn->seq_ack_from_tap; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > tcp_timer_ctl(c, conn); > } > @@ -3154,7 +3184,8 @@ static void tcp_sock_refill_init(const struct ctx *= c) > */ > int tcp_init(struct ctx *c) > { > - unsigned b; > + unsigned int b, optv =3D 0; > + int s; > =20 > for (b =3D 0; b < TCP_HASH_TABLE_SIZE; b++) > tc_hash[b] =3D FLOW_SIDX_NONE; > @@ -3178,6 +3209,17 @@ int tcp_init(struct ctx *c) > NS_CALL(tcp_ns_socks_init, c); > } > =20 > + /* Probe for SO_PEEK_OFF support */ > + s =3D socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); > + if (s < 0) { > + warn("Temporary TCP socket creation failed"); > + } else { > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) > + peek_offset_cap =3D true; > + close(s); > + } > + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); > + > return 0; > } > =20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --Yd7/lGpKXd6hhASv Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmZFb6QACgkQzQJF27ox 2GdgBQ//dkiteUH52y7YZyjkM+Hz3m4UwXodv+3sNhD4YL73Z5TpswnE8x5bhnsO +rFdVneVXTX10sAMpuGHd5QyS56YWoF1y8gW5zvpv7e4VK2SuVoHY6RPmQty6OnP kdO/Ub6OrHaS0PzYXeEIhMDWcg0DDynEFdQqHcRfxssYc3XXFu4a5DrVez4jomAT +yFwupwaHj/rjTs6qrUOcq8UwtgedoXTje6Ec5lbLDUC78xj+09te3oTvL5R+eRO MeoJINHd1rHBQGIqIJuhYDB2liGhdJMdG639evQ6cWIg9sr2HTjJlvO+N/GrEM93 vV5zaOORomr87j6IntlIjgE4KfLVbi0AqMke5w1HMJe+HRsGyii5eo8Jmma4CE0s kJmvadWHF/VoEydxFEkot2xIwnZ+f7U1DHFbJkf0iLiOWLNXR7yhAs4xSzHK7sjD 6/0+vogftDLTmxxqeFn6y+kf3lgPwLktzH5KAheQWqkvLIg7aM+9XXgpRvjv20I2 72nRsBt097IZLVWSjlhthPGob6gsORY/jC8wcl6rwQkzzPT61a/PwNLt2MqjVFVL xX6ynPp0r76JA6hA5X7saXoG0gbXp/1iIBp6K7Ypre8vbJxrFgvYCfFnEwARijDJ 8t4L8XUyqDsWxX2QIXdK/ENywb3/SnaYmU4ajPqGjoOz694WG5A= =2S0y -----END PGP SIGNATURE----- --Yd7/lGpKXd6hhASv--