From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from imap.gmail.com [173.194.76.109] by localhost with POP3 (fetchmail-6.3.26) for (single-drop); Mon, 20 May 2024 11:50:16 +0200 (CEST) Received: by 2002:a05:6a10:9148:b0:55f:c3c0:ed08 with SMTP id n8csp276505pxb; Mon, 20 May 2024 02:50:03 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCUJyPZFE0gm/LNog/LENOXcvuCrzrPMCytMkpYhrr+fDcmIBiAAIGO+iu4DNm4t/P63vrPlvoPmURH7vsbv7Jo1AA6CuPpt9MU= X-Google-Smtp-Source: AGHT+IGpNR53Nn9QzVQc6DgydLepuquVZ3HEzom3V1/VkF1QEcASch+vGSgf9gb0PvCs+6moLIXx X-Received: by 2002:ac8:5e4c:0:b0:43e:88e:2679 with SMTP id d75a77b69052e-43e088e2906mr287905761cf.41.1716198602872; Mon, 20 May 2024 02:50:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1716198602; cv=none; d=google.com; s=arc-20160816; b=S3tLxSKHRmGXsiB1HdjgWuXLaOu/WcQQZd2WegllxKEP+/ngzxLdURnKbhyuDNZ2+s syOT0ZGLRS05kxWP4neWUplTvLkZQ5PC+sNQpFO/O5fNx3WEwSPnAwwXgzIoYBbj2Wuz +vjWzgr+ph8qK+lN0Ag9kSwGFEzBB+/B2OYOQlvlfIBTiDXSHIWL9l4ayVHLT42GWk2K mSh+QPuJ4IBKI1SY4dMy7iyQ71uC3TU/CMFplhA0uDPp6KclVq6ICvxv8xBrDuhxcP5g REuT1/B4ZnRVE+faZahp4HgAG3kouDPIfJ0hyQpbFMfSrY+1YE2/5iGlziJLVtFYkZC9 hwpg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:dkim-signature:delivered-to; bh=wURgXJuh8CWxXppwgnThPTyFBOUam+WmobcDfJVTfXs=; fh=MBMSY9n9QGUzmoRBE6HD7OZVa/6vAxSpDuj2NqustW0=; b=pzzkXpv1Ax3h61PSxkByjLdRPQA+V0Hm8CC9A5s5UopJ42YwiAiDQBYkThsVsAtreR uoNJ1cn6v+1p81S9be2Zv49JOd0vO9JiLc6fVNFTHiHtQVQiZ97Ld3swmhsfvIXLSDeH 2LOOYjsFBhPP3G26r6oY/PlyxTCv5whKe4Zh87sbwsmJWh8JksGHUarulLs8ZnnSI6ly 6qidB3vpmynn9feAE2ldxS/0v4UvICNoFt/Qyb10ryfiiAgGwDAsgpDyeX9Hn2tqVSQs tog91Ls/UPDJFu3cEsPq90p1PVSmfkTthYpjc9u5WF5fiOFVBsD90Rles8gkg3wQbOFp ij/A==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=temperror (no key for signature) header.i=@gibson.dropbear.id.au header.s=202312 header.b="Frq/A2El"; spf=pass (google.com: domain of dgibson@gandalf.ozlabs.org designates 150.107.74.76 as permitted sender) smtp.mailfrom=dgibson@gandalf.ozlabs.org Return-Path: Received: from us-smtp-inbound-delivery-1.mimecast.com (us-smtp-delivery-1.mimecast.com. [205.139.110.120]) by mx.google.com with ESMTPS id d75a77b69052e-43df56d56d5si251432291cf.676.2024.05.20.02.50.01 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 May 2024 02:50:01 -0700 (PDT) Received-SPF: pass (google.com: domain of dgibson@gandalf.ozlabs.org designates 150.107.74.76 as permitted sender) client-ip=150.107.74.76; Authentication-Results: mx.google.com; dkim=temperror (no key for signature) header.i=@gibson.dropbear.id.au header.s=202312 header.b="Frq/A2El"; spf=pass (google.com: domain of dgibson@gandalf.ozlabs.org designates 150.107.74.76 as permitted sender) smtp.mailfrom=dgibson@gandalf.ozlabs.org Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-220-uAH5OzDjPRqUL_WQrWQ6zA-1; Mon, 20 May 2024 05:49:59 -0400 X-MC-Unique: uAH5OzDjPRqUL_WQrWQ6zA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 97D8681227E for ; Mon, 20 May 2024 09:49:59 +0000 (UTC) Received: by smtp.corp.redhat.com (Postfix) id 94AFD21EE56C; Mon, 20 May 2024 09:49:59 +0000 (UTC) Received: from mimecast-mx02.redhat.com (mimecast08.extmail.prod.ext.rdu2.redhat.com [10.11.55.24]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 5C846200A78F for ; Mon, 20 May 2024 09:49:59 +0000 (UTC) Received: from us-smtp-inbound-delivery-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2F03D38000A4 for ; Mon, 20 May 2024 09:49:59 +0000 (UTC) Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-618-FWemiUaJOUeijwpP-AhtAQ-1; Mon, 20 May 2024 05:49:55 -0400 X-MC-Unique: FWemiUaJOUeijwpP-AhtAQ-1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1716198590; bh=wURgXJuh8CWxXppwgnThPTyFBOUam+WmobcDfJVTfXs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Frq/A2Elzl2emXoPsgQUuW7lvVHwc+cMJNCzVkePgRM4sofaaIHgCBfUMbEjocijb bAkRgAver77XHyuAnPwdCawXuS35/s+Eh9HQVcu7Gb24OeubgD2V2D7Q9sQz4GCPp1 9rqowBzmcB2RqzwP4bYhDCAnCSyp1JM9KkKjF8A+B5LlVpJWe0NC9YrEHvxAnq9dP5 KWoMK4DoxALP/a+134lF/h08/259+Sz8Xlb88guj5BZ+PWJpjVEUrDNr0+wQmwOrIL igiZg1DOI775YcQ/cihvi88+AKe3xIijtv1OPgeNmIjLQII3Yl97sQDowgeYPO3fkn ap0hnuviX7VKg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4VjXq20T44z4x1Q; Mon, 20 May 2024 19:49:50 +1000 (AEST) Date: Mon, 20 May 2024 18:07:02 +1000 From: David Gibson To: Jon Maloy Cc: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com Subject: Re: [PATCH v6 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: References: <20240517152414.1188282-1-jmaloy@redhat.com> <20240517152414.1188282-3-jmaloy@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="rdsOfU7260shE6o3" Content-Disposition: inline In-Reply-To: <20240517152414.1188282-3-jmaloy@redhat.com> X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.6 List-Id: --rdsOfU7260shE6o3 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, May 17, 2024 at 11:24:13AM -0400, Jon Maloy wrote: > >From linux-6.9.0 the kernel will contain > commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). >=20 > This new feature makes is possible to call recv_msg(MSG_PEEK) and make > it start reading data from a given offset set by the SO_PEEK_OFF socket > option. This way, we can avoid repeated reading of already read bytes of > a received message, hence saving read cycles when forwarding TCP > messages in the host->name space direction. >=20 > In this commit, we add functionality to leverage this feature when > available, while we fall back to the previous behavior when not. >=20 > Measurements with iperf3 shows that throughput increases with 15-20 > percent in the host->namespace direction when this feature is used. >=20 > Signed-off-by: Jon Maloy Reviewed-by: David Gibson >=20 > --- > v2: - Some smaller changes as suggested by David Gibson and Stefano Brivi= o. > - Moved initial set_peek_offset(0) to only the locations where the so= cket is set > to ESTABLISHED. > - Removed the per-packet synchronization between sk_peek_off and > already_sent. Instead only doing it in retransmit situations. > - The problem I found when trouble shooting the occasionally occurring > out of synch values between 'already_sent' and 'sk_peek_offset' may > have deeper implications that we may need to be investigate. >=20 > v3: - Rebased to most recent version of tcp.c, plus the previous > patch in this series. > - Some changes based on feedback from PASST team >=20 > v4: - Some small changes based on feedback from Stefan/David. >=20 > v5: - Re-added accidentally dropped set_peek_offset() line. > Thank you, David. > --- > tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 51 insertions(+), 8 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 3a2350a..fa13292 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -511,6 +511,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP= _NUM_IOVS]; > static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > =20 > +/* Does the kernel support TCP_PEEK_OFF? */ > +static bool peek_offset_cap; > + > /* sendmsg() to socket */ > static struct iovec tcp_iov [UIO_MAXIOV]; > =20 > @@ -526,6 +529,20 @@ static_assert(ARRAY_SIZE(tc_hash) >=3D FLOW_MAX, > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > =20 > +/** > + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if support= ed > + * @s: Socket to update > + * @offset: Offset in bytes > + */ > +static void tcp_set_peek_offset(int s, int offset) > +{ > + if (!peek_offset_cap) > + return; > + > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > + err("Failed to set SO_PEEK_OFF to %i in socket %i", offset, s); > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection state > * @events: Current connection events > @@ -1273,6 +1290,7 @@ static void tcp_revert_seq(struct tcp_tap_conn **co= nns, struct iovec (*frames)[T > continue; > =20 > conn->seq_to_tap =3D seq; > + tcp_set_peek_offset(conn->sock, seq - conn->seq_ack_from_tap); > } > } > =20 > @@ -2199,14 +2217,15 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > uint32_t already_sent, seq; > struct iovec *iov; > =20 > + /* How much have we read/sent since last received ack ? */ > already_sent =3D conn->seq_to_tap - conn->seq_ack_from_tap; > - > if (SEQ_LT(already_sent, 0)) { > /* RFC 761, section 2.1. */ > flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > conn->seq_ack_from_tap, conn->seq_to_tap); > conn->seq_to_tap =3D conn->seq_ack_from_tap; > already_sent =3D 0; > + tcp_set_peek_offset(s, 0); > } > =20 > if (!wnd_scaled || already_sent >=3D wnd_scaled) { > @@ -2224,11 +2243,16 @@ static int tcp_data_from_sock(struct ctx *c, stru= ct tcp_tap_conn *conn) > iov_rem =3D (wnd_scaled - already_sent) % mss; > } > =20 > - mh_sock.msg_iov =3D iov_sock; > - mh_sock.msg_iovlen =3D fill_bufs + 1; > - > - iov_sock[0].iov_base =3D tcp_buf_discard; > - iov_sock[0].iov_len =3D already_sent; > + /* Prepare iov according to kernel capability */ > + if (!peek_offset_cap) { > + mh_sock.msg_iov =3D iov_sock; > + iov_sock[0].iov_base =3D tcp_buf_discard; > + iov_sock[0].iov_len =3D already_sent; > + mh_sock.msg_iovlen =3D fill_bufs + 1; > + } else { > + mh_sock.msg_iov =3D &iov_sock[1]; > + mh_sock.msg_iovlen =3D fill_bufs; > + } > =20 > if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || > (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { > @@ -2269,7 +2293,10 @@ static int tcp_data_from_sock(struct ctx *c, struc= t tcp_tap_conn *conn) > return 0; > } > =20 > - sendlen =3D len - already_sent; > + sendlen =3D len; > + if (!peek_offset_cap) > + sendlen -=3D already_sent; > + > if (sendlen <=3D 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2440,6 +2467,7 @@ static int tcp_data_from_tap(struct ctx *c, struct = tcp_tap_conn *conn, > "fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); > conn->seq_to_tap =3D max_ack_seq; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > } > =20 > @@ -2532,6 +2560,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c= , struct tcp_tap_conn *conn, > conn->seq_ack_to_tap =3D conn->seq_from_tap; > =20 > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > =20 > /* The client might have sent data already, which we didn't > * dequeue waiting for SYN,ACK from tap -- check now. > @@ -2612,6 +2641,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_= family_t af, > goto reset; > =20 > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > =20 > if (th->fin) { > conn->seq_from_tap++; > @@ -2865,6 +2895,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_r= ef ref) > flow_dbg(conn, "ACK timeout, retry"); > conn->retrans++; > conn->seq_to_tap =3D conn->seq_ack_from_tap; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > tcp_timer_ctl(c, conn); > } > @@ -3156,7 +3187,8 @@ static void tcp_sock_refill_init(const struct ctx *= c) > */ > int tcp_init(struct ctx *c) > { > - unsigned b; > + unsigned int b, optv =3D 0; > + int s; > =20 > for (b =3D 0; b < TCP_HASH_TABLE_SIZE; b++) > tc_hash[b] =3D FLOW_SIDX_NONE; > @@ -3180,6 +3212,17 @@ int tcp_init(struct ctx *c) > NS_CALL(tcp_ns_socks_init, c); > } > =20 > + /* Probe for SO_PEEK_OFF support */ > + s =3D socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); > + if (s < 0) { > + warn("Temporary TCP socket creation failed"); > + } else { > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) > + peek_offset_cap =3D true; > + close(s); > + } > + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); > + > return 0; > } > =20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --rdsOfU7260shE6o3 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmZLBKYACgkQzQJF27ox 2Gd8rQ/+Mpei4wMW2UmTwG53pentG1MUu1OGgpVlUuZeFVTxgsJ9EIVUT7TbTLWn zkYnBVxVxuPPlrj1VfrZHHNWlaXgvnWCGHbcbATN293e3WKEVOAeE3b4Y5+3xwl7 juG6NsFWmgBmas3603p3tdiBo3ukU3jdA/HlvuF2NJagB9iaOPiXaXj2r6T4yt22 hburFEJ/fAk+bACO6kMnnAE5EuuTPDZCv8nShuqhSo8uMfQQSqUqTycbvSEyLLpb DrKY2V3S72NYI7qdwK8zKhCsVdG582achPr51NvpqQQ55wwK8ChUwkGr2UvLIWVA 9hUktLEQjMiQrJF2QP9cIC8WJkoex9CnjZI9QbzASonSzLJ73znLl+wLeOI2epeL sZuuT3EfTprlPS5Cf4MRWUCNuAjqqekyc6l7sKkE/5vrMfJw5h3kLgGUNXUbR8uw uquHNIq//x/GVYEyHtBpFTGpuYyvWeYKWZ4xzPZjFrb+4m3HvOTjBkE0dyxLEKct kU8z1fb6Unwg72+5DQcbdwJtpAR8Oz44nRXkxarf689wVLWvPXFk4z7nxPXxJfQV /KuvXOK78t5ZEpVKBcz2+2blNNiMnXSQo+FGmI4Wzx8nZ4ANDW8mDYy4UPZwW+Kz lKi2MdNvDmphoXk6P9wVJ/HVOH9QCS0HmQc1FvcGe45ym/HpYto= =ucGb -----END PGP SIGNATURE----- --rdsOfU7260shE6o3--