From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id B18685A02C3
	for <passt-dev@passt.top>; Mon, 13 May 2024 04:23:31 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202312; t=1715567007;
	bh=oHII2j2f6id/6J+yKnXww0p4ev+vM53RCL6eD/+ByV4=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=nQYF4liD7e2HXj2XjfUhtYY3fbgdpEeVS0xOCEX/N7q1JL40pMdGLotcGsqVgcPYp
	 pRf16GYN1d+5hk/Z8YcWAHHgqp7TVa/UpB4TuubrxBHEHKtYSYS4kaAtYWycL/KAh4
	 qwnIk74WgL+I9jeVPnCSHxzS1pPOgfmdZ8rvJmJ8nklDQabKUlVVyG5dkquLAm3Ae3
	 RDT6+psyguX0Ajf0FF0wZOFimwD2bbDYNOYiJQPRQWLKBs85tccfNUSM2N1/dl3fDn
	 xsqzxXeZvbRduX5McJQjO9W3DZB5BKtotKsRDRJmMOCKef9N8kcCKfSBSGInAnOcHv
	 4E9ss3gkmU8tA==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4Vd3FC3pBsz4wxt; Mon, 13 May 2024 12:23:27 +1000 (AEST)
Date: Mon, 13 May 2024 12:23:19 +1000
From: David Gibson <david@gibson.dropbear.id.au>
To: Jon Maloy <jmaloy@redhat.com>
Subject: Re: [PATCH v3 2/3] tcp: leverage support of SO_PEEK_OFF socket
 option when available
Message-ID: <ZkF5l6BdP7bld6xY@zatzit>
References: <20240511152008.421750-1-jmaloy@redhat.com>
 <20240511152008.421750-3-jmaloy@redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="gttwQlCv6hDx5P0O"
Content-Disposition: inline
In-Reply-To: <20240511152008.421750-3-jmaloy@redhat.com>
Message-ID-Hash: RYGK7BHHS7X63HXQ2KQAAYPPGJ7X7JTU
X-Message-ID-Hash: RYGK7BHHS7X63HXQ2KQAAYPPGJ7X7JTU
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/ZkF5l6BdP7bld6xY@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/RYGK7BHHS7X63HXQ2KQAAYPPGJ7X7JTU/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--gttwQlCv6hDx5P0O
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, May 11, 2024 at 11:20:07AM -0400, Jon Maloy wrote:
> >From linux-6.9.0 the kernel will contain
> commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").
>=20
> This new feature makes is possible to call recv_msg(MSG_PEEK) and make
> it start reading data from a given offset set by the SO_PEEK_OFF socket
> option. This way, we can avoid repeated reading of already read bytes of
> a received message, hence saving read cycles when forwarding TCP
> messages in the host->name space direction.
>=20
> In this commit, we add functionality to leverage this feature when
> available, while we fall back to the previous behavior when not.
>=20
> Measurements with iperf3 shows that throughput increases with 15-20
> percent in the host->namespace direction when this feature is used.
>=20
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>=20
> ---
> v2: - Some smaller changes as suggested by David Gibson and Stefano Brivi=
o.
>     - Moved initial set_peek_offset(0) to only the locations where the so=
cket is set
>       to ESTABLISHED.
>     - Removed the per-packet synchronization between sk_peek_off and
>       already_sent. Instead only doing it in retransmit situations.
>     - The problem I found when trouble shooting the occasionally occurring
>       out of synch values between 'already_sent' and 'sk_peek_offset' may
>       have deeper implications that we may need to be investigate.
>=20
> v3: - Rebased to most recent version of tcp.c, plus the previous
>       patch in this series.
>     - Some changes based on feedback from PASST team
> ---
>  tcp.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 63 insertions(+), 14 deletions(-)
>=20
> diff --git a/tcp.c b/tcp.c
> index 21cbfba..8297812 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -520,6 +520,9 @@ static struct iovec	tcp6_l2_iov		[TCP_FRAMES_MEM][TCP=
_NUM_IOVS];
>  static struct iovec	tcp4_l2_flags_iov	[TCP_FRAMES_MEM][TCP_NUM_IOVS];
>  static struct iovec	tcp6_l2_flags_iov	[TCP_FRAMES_MEM][TCP_NUM_IOVS];
> =20
> +/* Does the kernel support TCP_PEEK_OFF? */
> +static bool peek_offset_cap;
> +
>  /* sendmsg() to socket */
>  static struct iovec	tcp_iov			[UIO_MAXIOV];
> =20
> @@ -535,6 +538,20 @@ static_assert(ARRAY_SIZE(tc_hash) >=3D FLOW_MAX,
>  int init_sock_pool4		[TCP_SOCK_POOL_SIZE];
>  int init_sock_pool6		[TCP_SOCK_POOL_SIZE];
> =20
> +/**
> + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if support=
ed
> + * @s:          Socket to update
> + * @offset:     Offset in bytes
> + */
> +static void tcp_set_peek_offset(int s, int offset)
> +{
> +	if (!peek_offset_cap)
> +		return;
> +
> +	if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset)))
> +		err("Failed to set SO_PEEK_OFF");
> +}
> +
>  /**
>   * tcp_conn_epoll_events() - epoll events mask for given connection state
>   * @events:	Current connection events
> @@ -1280,11 +1297,14 @@ static void tcp_revert_seq(struct tcp_frame_ref *=
frame_ref, int first, int last)
>  			continue;
> =20
>  		conn->seq_to_tap =3D frame_ref[i].seq;
> +		tcp_set_peek_offset(conn->sock,
> +				    conn->seq_to_tap - conn->seq_ack_from_tap);
> =20
>  		if (SEQ_GE(conn->seq_to_tap, conn->seq_ack_from_tap))
>  			continue;
> =20
>  		conn->seq_to_tap =3D conn->seq_ack_from_tap;
> +		tcp_set_peek_offset(conn->sock, 0);
>  	}
>  }
> =20
> @@ -2203,42 +2223,52 @@ static int tcp_data_from_sock(struct ctx *c, stru=
ct tcp_tap_conn *conn)
>  	uint32_t wnd_scaled =3D conn->wnd_from_tap << conn->ws_from_tap;
>  	int fill_bufs, send_bufs =3D 0, last_len, iov_rem =3D 0;
>  	int sendlen, len, dlen, v4 =3D CONN_V4(conn);
> +	uint32_t max_send, seq, already_sent;
>  	int s =3D conn->sock, i, ret =3D 0;
>  	struct msghdr mh_sock =3D { 0 };
>  	uint16_t mss =3D MSS_GET(conn);
> -	uint32_t already_sent, seq;
>  	struct iovec *iov;
> =20
> +	/* How much have we read/sent since last received ack ? */
>  	already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap;
> -
>  	if (SEQ_LT(already_sent, 0)) {
>  		/* RFC 761, section 2.1. */
>  		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
>  			   conn->seq_ack_from_tap, conn->seq_to_tap);
>  		conn->seq_to_tap =3D conn->seq_ack_from_tap;
>  		already_sent =3D 0;
> +		tcp_set_peek_offset(s, 0);
>  	}
> =20
> -	if (!wnd_scaled || already_sent >=3D wnd_scaled) {
> +	/* How much are we still allowed to send within current window ? */

Does this hunk actually related to the SO_PEEK_OFF change, or does it
belong with the zero window fix in the next patch?

> +	max_send =3D conn->seq_ack_from_tap + wnd_scaled - conn->seq_to_tap;
> +	if (SEQ_LE(max_send, 0)) {

IIUC, 'max_send' is an difference in sequence numbers, not an absolute
sequence number, and should therefore be compared with a regular <=3D
rather than SEQ_LE.  Although it looks like you'd also need to invert
the sign to make that work,

> +		flow_trace(conn, "Empty window: win: %u, sent: %u",
> +			   wnd_scaled, conn->seq_to_tap);



>  		conn_flag(c, conn, STALLED);
>  		conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  		return 0;
>  	}
> =20
> -	/* Set up buffer descriptors we'll fill completely and partially. */
> -	fill_bufs =3D DIV_ROUND_UP(wnd_scaled - already_sent, mss);
> +	/* Set up buffer descriptors to fill completely or partially. */
> +	fill_bufs =3D DIV_ROUND_UP(max_send, mss);
>  	if (fill_bufs > TCP_FRAMES) {
>  		fill_bufs =3D TCP_FRAMES;
>  		iov_rem =3D 0;
>  	} else {
> -		iov_rem =3D (wnd_scaled - already_sent) % mss;
> +		iov_rem =3D max_send % mss;
>  	}
> =20
> -	mh_sock.msg_iov =3D iov_sock;
> -	mh_sock.msg_iovlen =3D fill_bufs + 1;
> -
> -	iov_sock[0].iov_base =3D tcp_buf_discard;
> -	iov_sock[0].iov_len =3D already_sent;
> +	/* Prepare iov according to kernel capability */
> +	if (!peek_offset_cap) {
> +		mh_sock.msg_iov =3D iov_sock;
> +		iov_sock[0].iov_base =3D tcp_buf_discard;
> +		iov_sock[0].iov_len =3D already_sent;
> +		mh_sock.msg_iovlen =3D fill_bufs + 1;
> +	} else {
> +		mh_sock.msg_iov =3D &iov_sock[1];
> +		mh_sock.msg_iovlen =3D fill_bufs;
> +	}
> =20
>  	if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) ||
>  	    (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) {
> @@ -2279,7 +2309,10 @@ static int tcp_data_from_sock(struct ctx *c, struc=
t tcp_tap_conn *conn)
>  		return 0;
>  	}
> =20
> -	sendlen =3D len - already_sent;
> +	sendlen =3D len;
> +	if (!peek_offset_cap)
> +		sendlen -=3D already_sent;
> +
>  	if (sendlen <=3D 0) {
>  		conn_flag(c, conn, STALLED);
>  		return 0;
> @@ -2449,7 +2482,9 @@ static int tcp_data_from_tap(struct ctx *c, struct =
tcp_tap_conn *conn,
>  		flow_trace(conn,
>  			   "fast re-transmit, ACK: %u, previous sequence: %u",
>  			   max_ack_seq, conn->seq_to_tap);
> +
>  		conn->seq_to_tap =3D max_ack_seq;
> +		tcp_set_peek_offset(conn->sock, 0);
>  		tcp_data_from_sock(c, conn);
>  	}
> =20
> @@ -2542,6 +2577,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c=
, struct tcp_tap_conn *conn,
>  	conn->seq_ack_to_tap =3D conn->seq_from_tap;
> =20
>  	conn_event(c, conn, ESTABLISHED);
> +	tcp_set_peek_offset(conn->sock, 0);
> =20
>  	/* The client might have sent data already, which we didn't
>  	 * dequeue waiting for SYN,ACK from tap -- check now.
> @@ -2622,6 +2658,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_=
family_t af,
>  			goto reset;
> =20
>  		conn_event(c, conn, ESTABLISHED);
> +		tcp_set_peek_offset(conn->sock, 0);
> =20
>  		if (th->fin) {
>  			conn->seq_from_tap++;
> @@ -2788,7 +2825,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_=
ref ref,
>  	union sockaddr_inany sa;
>  	socklen_t sl =3D sizeof(sa);
>  	union flow *flow;
> -	int s;
> +	int s =3D 0;

This appears unrelated to the rest.  Also wrong, since stdin is not a
reasonable initial value for this fd.

>  	if (c->no_tcp || !(flow =3D flow_alloc()))
>  		return;
> @@ -2875,6 +2912,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r=
ef ref)
>  			flow_dbg(conn, "ACK timeout, retry");
>  			conn->retrans++;
>  			conn->seq_to_tap =3D conn->seq_ack_from_tap;
> +			tcp_set_peek_offset(conn->sock, 0);
>  			tcp_data_from_sock(c, conn);
>  			tcp_timer_ctl(c, conn);
>  		}
> @@ -3166,7 +3204,8 @@ static void tcp_sock_refill_init(const struct ctx *=
c)
>   */
>  int tcp_init(struct ctx *c)
>  {
> -	unsigned b;
> +	unsigned int b, optv =3D 0;
> +	int s;
> =20
>  	for (b =3D 0; b < TCP_HASH_TABLE_SIZE; b++)
>  		tc_hash[b] =3D FLOW_SIDX_NONE;
> @@ -3190,6 +3229,16 @@ int tcp_init(struct ctx *c)
>  		NS_CALL(tcp_ns_socks_init, c);
>  	}
> =20
> +	/* Probe for SO_PEEK_OFF support */
> +	s =3D socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP);
> +	if (s < 0) {
> +		warn("Temporary TCP socket creation failed");
> +	} else {
> +		if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int)))
> +			peek_offset_cap =3D true;
> +		close(s);
> +	}
> +	info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");
>  	return 0;
>  }
> =20

--=20
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--gttwQlCv6hDx5P0O
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmZBeZYACgkQzQJF27ox
2GcWqg/+LqzdIjyU606EJL67g5cqC7VEScZCSe+3tYVxbToTm8d2RSt1+hsFXtqF
nZwGL6cV35/5Gl/4suj1Lm9o5TmqIt1p1d5/g/80qW4YG8mXtwwtNTNGBx2atFWk
4zhYcrqPwLW1Zz/AOSW+xg5X7lG+rV9j5xqZtONw5/ckhr6I6G6X3fN6jidu0f7B
jfvTzNXo4kFhja/qZKJpILhHA4KVmlWsSETWH1hwOPXFIxzgl8nT/K+dmK4KXQ7F
n6ho+CruB6bTtHFUZ6ixIX7Aw1eCdEyE+ElVkjIEAT83uniuZxcSjZnGveqx2GEA
qcZiuZ8xA7G0bU5+Y0fSUl4LpFKXL7e7fM/+/RymxPcxbdYLq3yDksaOThqFbGxr
/JHiO9Q5sFgTvmDxv16auOV9qdLUitxqC7MDGYG+Fx49nNSj4wFzA1FcO/iClZAE
E+eCSSfe3cP+Ib7maLB9/5Sx14N5LIDHsyYTN2fKa9DG1RRQCjEHUaeBk/XhZoDP
vkThISYW919GoXtVXMZAWomtoBcK1le/rWW5VLZv4mhlEn5SvybisTEmKSdElmT6
vmMNA4J4LOSlUKyGZ+jgTLVqsKXjQfeWkKPuURH6vOSBLmWhR7Hv1vpMxIUDEjvl
ge5VwsfZjU2JXhTpx4NufPTWP86jU/a07sGcxx22Ejt41/H4glk=
=+kNv
-----END PGP SIGNATURE-----

--gttwQlCv6hDx5P0O--