From: Paolo Abeni <pabeni@redhat.com>
To: Eric Dumazet <edumazet@google.com>
Cc: kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com,
lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com,
netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF
Date: Tue, 13 Feb 2024 11:49:14 +0100 [thread overview]
Message-ID: <8d77d8a4e6a37e80aa46cd8df98de84714c384a5.camel@redhat.com> (raw)
In-Reply-To: <20240209221233.3150253-1-jmaloy@redhat.com>
Oops,
I just noticed Eric is missing from the recipients list, adding him
now.
On Fri, 2024-02-09 at 17:12 -0500, jmaloy@redhat.com wrote:
> From: Jon Maloy <jmaloy@redhat.com>
>
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> in a similar way it is done for Unix Domain sockets.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of 15-20 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF not supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
> [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
> [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
> [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
> [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
> [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
> [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
> [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
> [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
> [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@freyr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@freyr:~/passt$
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 52084
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec
> [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
> [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec
> [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec
> [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec
> [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec
> [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec
> [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec
> [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec
> [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec
> [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@freyr:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> ---
> v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
> ---
> net/ipv4/af_inet.c | 1 +
> net/ipv4/tcp.c | 16 ++++++++++------
> 2 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 4e635dd3d3c8..5f0e5d10c416 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = {
> #endif
> .splice_eof = inet_splice_eof,
> .splice_read = tcp_splice_read,
> + .set_peek_off = sk_set_peek_off,
> .read_sock = tcp_read_sock,
> .read_skb = tcp_read_skb,
> .sendmsg_locked = tcp_sendmsg_locked,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 7e2481b9eae1..1c8cab14a32c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
> struct sk_buff *skb;
> int copied = 0, err = 0;
>
> - /* XXX -- need to support SO_PEEK_OFF */
> -
> skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
> if (err)
> @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> int target; /* Read at least this many bytes */
> long timeo;
> struct sk_buff *skb, *last;
> + u32 peek_offset = 0;
> u32 urg_hole = 0;
>
> err = -ENOTCONN;
> @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>
> seq = &tp->copied_seq;
> if (flags & MSG_PEEK) {
> - peek_seq = tp->copied_seq;
> + peek_offset = max(sk_peek_offset(sk, flags), 0);
> + peek_seq = tp->copied_seq + peek_offset;
> seq = &peek_seq;
> }
>
> @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> }
>
> if ((flags & MSG_PEEK) &&
> - (peek_seq - copied - urg_hole != tp->copied_seq)) {
> + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
> net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
> current->comm,
> task_pid_nr(current));
> - peek_seq = tp->copied_seq;
> + peek_seq = tp->copied_seq + peek_offset;
> }
> continue;
>
> @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> WRITE_ONCE(*seq, *seq + used);
> copied += used;
> len -= used;
> -
> + if (flags & MSG_PEEK)
> + sk_peek_offset_fwd(sk, used);
> + else
> + sk_peek_offset_bwd(sk, used);
> tcp_rcv_space_adjust(sk);
>
> skip_copy:
> @@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags)
> __skb_queue_purge(&sk->sk_receive_queue);
> WRITE_ONCE(tp->copied_seq, tp->rcv_nxt);
> WRITE_ONCE(tp->urg_data, 0);
> + sk_set_peek_off(sk, -1);
> tcp_write_queue_purge(sk);
> tcp_fastopen_active_disable_ofo_check(sk);
> skb_rbtree_purge(&tp->out_of_order_queue);
next prev parent reply other threads:[~2024-02-13 10:49 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-09 22:12 [PATCH v3] tcp: add support for SO_PEEK_OFF jmaloy
2024-02-11 23:17 ` Stefano Brivio
2024-02-13 10:49 ` Paolo Abeni [this message]
2024-02-13 12:24 ` Eric Dumazet
2024-02-13 13:02 ` Paolo Abeni
2024-02-13 13:34 ` Eric Dumazet
2024-02-13 15:28 ` Paolo Abeni
2024-02-13 15:49 ` Eric Dumazet
2024-02-13 18:39 ` Paolo Abeni
2024-02-13 19:31 ` Eric Dumazet
[not found] ` <20687849-ec5c-9ce5-0a18-cc80f5b64816@redhat.com>
2024-02-15 17:41 ` Paolo Abeni
2024-02-15 17:46 ` Eric Dumazet
2024-02-15 22:24 ` Jon Maloy
2024-02-16 9:14 ` Paolo Abeni
2024-02-16 9:21 ` Eric Dumazet
[not found] ` <6a9f5dec-eb0c-51ef-0911-7345f50e08f0@redhat.com>
2024-02-16 10:55 ` Eric Dumazet
2024-02-19 2:02 ` David Gibson
2024-02-13 23:34 ` David Gibson
2024-02-14 3:41 ` Eric Dumazet
2024-02-15 3:16 ` David Gibson
2024-02-15 3:21 ` David Gibson
2024-02-15 9:15 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8d77d8a4e6a37e80aa46cd8df98de84714c384a5.camel@redhat.com \
--to=pabeni@redhat.com \
--cc=davem@davemloft.net \
--cc=dgibson@redhat.com \
--cc=edumazet@google.com \
--cc=jmaloy@redhat.com \
--cc=kuba@kernel.org \
--cc=lvivier@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).