From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id B18685A02C3 for ; Mon, 13 May 2024 04:23:31 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1715567007; bh=oHII2j2f6id/6J+yKnXww0p4ev+vM53RCL6eD/+ByV4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nQYF4liD7e2HXj2XjfUhtYY3fbgdpEeVS0xOCEX/N7q1JL40pMdGLotcGsqVgcPYp pRf16GYN1d+5hk/Z8YcWAHHgqp7TVa/UpB4TuubrxBHEHKtYSYS4kaAtYWycL/KAh4 qwnIk74WgL+I9jeVPnCSHxzS1pPOgfmdZ8rvJmJ8nklDQabKUlVVyG5dkquLAm3Ae3 RDT6+psyguX0Ajf0FF0wZOFimwD2bbDYNOYiJQPRQWLKBs85tccfNUSM2N1/dl3fDn xsqzxXeZvbRduX5McJQjO9W3DZB5BKtotKsRDRJmMOCKef9N8kcCKfSBSGInAnOcHv 4E9ss3gkmU8tA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Vd3FC3pBsz4wxt; Mon, 13 May 2024 12:23:27 +1000 (AEST) Date: Mon, 13 May 2024 12:23:19 +1000 From: David Gibson To: Jon Maloy Subject: Re: [PATCH v3 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: References: <20240511152008.421750-1-jmaloy@redhat.com> <20240511152008.421750-3-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="gttwQlCv6hDx5P0O" Content-Disposition: inline In-Reply-To: <20240511152008.421750-3-jmaloy@redhat.com> Message-ID-Hash: RYGK7BHHS7X63HXQ2KQAAYPPGJ7X7JTU X-Message-ID-Hash: RYGK7BHHS7X63HXQ2KQAAYPPGJ7X7JTU X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --gttwQlCv6hDx5P0O Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, May 11, 2024 at 11:20:07AM -0400, Jon Maloy wrote: > >From linux-6.9.0 the kernel will contain > commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). >=20 > This new feature makes is possible to call recv_msg(MSG_PEEK) and make > it start reading data from a given offset set by the SO_PEEK_OFF socket > option. This way, we can avoid repeated reading of already read bytes of > a received message, hence saving read cycles when forwarding TCP > messages in the host->name space direction. >=20 > In this commit, we add functionality to leverage this feature when > available, while we fall back to the previous behavior when not. >=20 > Measurements with iperf3 shows that throughput increases with 15-20 > percent in the host->namespace direction when this feature is used. >=20 > Signed-off-by: Jon Maloy >=20 > --- > v2: - Some smaller changes as suggested by David Gibson and Stefano Brivi= o. > - Moved initial set_peek_offset(0) to only the locations where the so= cket is set > to ESTABLISHED. > - Removed the per-packet synchronization between sk_peek_off and > already_sent. Instead only doing it in retransmit situations. > - The problem I found when trouble shooting the occasionally occurring > out of synch values between 'already_sent' and 'sk_peek_offset' may > have deeper implications that we may need to be investigate. >=20 > v3: - Rebased to most recent version of tcp.c, plus the previous > patch in this series. > - Some changes based on feedback from PASST team > --- > tcp.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 63 insertions(+), 14 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 21cbfba..8297812 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -520,6 +520,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP= _NUM_IOVS]; > static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > =20 > +/* Does the kernel support TCP_PEEK_OFF? */ > +static bool peek_offset_cap; > + > /* sendmsg() to socket */ > static struct iovec tcp_iov [UIO_MAXIOV]; > =20 > @@ -535,6 +538,20 @@ static_assert(ARRAY_SIZE(tc_hash) >=3D FLOW_MAX, > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > =20 > +/** > + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if support= ed > + * @s: Socket to update > + * @offset: Offset in bytes > + */ > +static void tcp_set_peek_offset(int s, int offset) > +{ > + if (!peek_offset_cap) > + return; > + > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > + err("Failed to set SO_PEEK_OFF"); > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection state > * @events: Current connection events > @@ -1280,11 +1297,14 @@ static void tcp_revert_seq(struct tcp_frame_ref *= frame_ref, int first, int last) > continue; > =20 > conn->seq_to_tap =3D frame_ref[i].seq; > + tcp_set_peek_offset(conn->sock, > + conn->seq_to_tap - conn->seq_ack_from_tap); > =20 > if (SEQ_GE(conn->seq_to_tap, conn->seq_ack_from_tap)) > continue; > =20 > conn->seq_to_tap =3D conn->seq_ack_from_tap; > + tcp_set_peek_offset(conn->sock, 0); > } > } > =20 > @@ -2203,42 +2223,52 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > int fill_bufs, send_bufs =3D 0, last_len, iov_rem =3D 0; > int sendlen, len, dlen, v4 =3D CONN_V4(conn); > + uint32_t max_send, seq, already_sent; > int s =3D conn->sock, i, ret =3D 0; > struct msghdr mh_sock =3D { 0 }; > uint16_t mss =3D MSS_GET(conn); > - uint32_t already_sent, seq; > struct iovec *iov; > =20 > + /* How much have we read/sent since last received ack ? */ > already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > - > if (SEQ_LT(already_sent, 0)) { > /* RFC 761, section 2.1. */ > flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > conn->seq_ack_from_tap, conn->seq_to_tap); > conn->seq_to_tap =3D conn->seq_ack_from_tap; > already_sent =3D 0; > + tcp_set_peek_offset(s, 0); > } > =20 > - if (!wnd_scaled || already_sent >=3D wnd_scaled) { > + /* How much are we still allowed to send within current window ? */ Does this hunk actually related to the SO_PEEK_OFF change, or does it belong with the zero window fix in the next patch? > + max_send =3D conn->seq_ack_from_tap + wnd_scaled - conn->seq_to_tap; > + if (SEQ_LE(max_send, 0)) { IIUC, 'max_send' is an difference in sequence numbers, not an absolute sequence number, and should therefore be compared with a regular <=3D rather than SEQ_LE. Although it looks like you'd also need to invert the sign to make that work, > + flow_trace(conn, "Empty window: win: %u, sent: %u", > + wnd_scaled, conn->seq_to_tap); > conn_flag(c, conn, STALLED); > conn_flag(c, conn, ACK_FROM_TAP_DUE); > return 0; > } > =20 > - /* Set up buffer descriptors we'll fill completely and partially. */ > - fill_bufs =3D DIV_ROUND_UP(wnd_scaled - already_sent, mss); > + /* Set up buffer descriptors to fill completely or partially. */ > + fill_bufs =3D DIV_ROUND_UP(max_send, mss); > if (fill_bufs > TCP_FRAMES) { > fill_bufs =3D TCP_FRAMES; > iov_rem =3D 0; > } else { > - iov_rem =3D (wnd_scaled - already_sent) % mss; > + iov_rem =3D max_send % mss; > } > =20 > - mh_sock.msg_iov =3D iov_sock; > - mh_sock.msg_iovlen =3D fill_bufs + 1; > - > - iov_sock[0].iov_base =3D tcp_buf_discard; > - iov_sock[0].iov_len =3D already_sent; > + /* Prepare iov according to kernel capability */ > + if (!peek_offset_cap) { > + mh_sock.msg_iov =3D iov_sock; > + iov_sock[0].iov_base =3D tcp_buf_discard; > + iov_sock[0].iov_len =3D already_sent; > + mh_sock.msg_iovlen =3D fill_bufs + 1; > + } else { > + mh_sock.msg_iov =3D &iov_sock[1]; > + mh_sock.msg_iovlen =3D fill_bufs; > + } > =20 > if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || > (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { > @@ -2279,7 +2309,10 @@ static int tcp_data_from_sock(struct ctx *c, struc= t tcp_tap_conn *conn) > return 0; > } > =20 > - sendlen =3D len - already_sent; > + sendlen =3D len; > + if (!peek_offset_cap) > + sendlen -=3D already_sent; > + > if (sendlen <=3D 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2449,7 +2482,9 @@ static int tcp_data_from_tap(struct ctx *c, struct = tcp_tap_conn *conn, > flow_trace(conn, > "fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); > + > conn->seq_to_tap =3D max_ack_seq; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > } > =20 > @@ -2542,6 +2577,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c= , struct tcp_tap_conn *conn, > conn->seq_ack_to_tap =3D conn->seq_from_tap; > =20 > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > =20 > /* The client might have sent data already, which we didn't > * dequeue waiting for SYN,ACK from tap -- check now. > @@ -2622,6 +2658,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_= family_t af, > goto reset; > =20 > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > =20 > if (th->fin) { > conn->seq_from_tap++; > @@ -2788,7 +2825,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_= ref ref, > union sockaddr_inany sa; > socklen_t sl =3D sizeof(sa); > union flow *flow; > - int s; > + int s =3D 0; This appears unrelated to the rest. Also wrong, since stdin is not a reasonable initial value for this fd. > if (c->no_tcp || !(flow =3D flow_alloc())) > return; > @@ -2875,6 +2912,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r= ef ref) > flow_dbg(conn, "ACK timeout, retry"); > conn->retrans++; > conn->seq_to_tap =3D conn->seq_ack_from_tap; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > tcp_timer_ctl(c, conn); > } > @@ -3166,7 +3204,8 @@ static void tcp_sock_refill_init(const struct ctx *= c) > */ > int tcp_init(struct ctx *c) > { > - unsigned b; > + unsigned int b, optv =3D 0; > + int s; > =20 > for (b =3D 0; b < TCP_HASH_TABLE_SIZE; b++) > tc_hash[b] =3D FLOW_SIDX_NONE; > @@ -3190,6 +3229,16 @@ int tcp_init(struct ctx *c) > NS_CALL(tcp_ns_socks_init, c); > } > =20 > + /* Probe for SO_PEEK_OFF support */ > + s =3D socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); > + if (s < 0) { > + warn("Temporary TCP socket creation failed"); > + } else { > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) > + peek_offset_cap =3D true; > + close(s); > + } > + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); > return 0; > } > =20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --gttwQlCv6hDx5P0O Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmZBeZYACgkQzQJF27ox 2GcWqg/+LqzdIjyU606EJL67g5cqC7VEScZCSe+3tYVxbToTm8d2RSt1+hsFXtqF nZwGL6cV35/5Gl/4suj1Lm9o5TmqIt1p1d5/g/80qW4YG8mXtwwtNTNGBx2atFWk 4zhYcrqPwLW1Zz/AOSW+xg5X7lG+rV9j5xqZtONw5/ckhr6I6G6X3fN6jidu0f7B jfvTzNXo4kFhja/qZKJpILhHA4KVmlWsSETWH1hwOPXFIxzgl8nT/K+dmK4KXQ7F n6ho+CruB6bTtHFUZ6ixIX7Aw1eCdEyE+ElVkjIEAT83uniuZxcSjZnGveqx2GEA qcZiuZ8xA7G0bU5+Y0fSUl4LpFKXL7e7fM/+/RymxPcxbdYLq3yDksaOThqFbGxr /JHiO9Q5sFgTvmDxv16auOV9qdLUitxqC7MDGYG+Fx49nNSj4wFzA1FcO/iClZAE E+eCSSfe3cP+Ib7maLB9/5Sx14N5LIDHsyYTN2fKa9DG1RRQCjEHUaeBk/XhZoDP vkThISYW919GoXtVXMZAWomtoBcK1le/rWW5VLZv4mhlEn5SvybisTEmKSdElmT6 vmMNA4J4LOSlUKyGZ+jgTLVqsKXjQfeWkKPuURH6vOSBLmWhR7Hv1vpMxIUDEjvl ge5VwsfZjU2JXhTpx4NufPTWP86jU/a07sGcxx22Ejt41/H4glk= =+kNv -----END PGP SIGNATURE----- --gttwQlCv6hDx5P0O--