From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Cc: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com
Subject: Re: [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK
Date: Thu, 11 Jan 2024 16:31:24 +0100 [thread overview]
Message-ID: <20240111163124.5d87f364@elisabeth> (raw)
In-Reply-To: <20240111131917.28741-1-jmaloy@redhat.com>
On Thu, 11 Jan 2024 08:19:17 -0500
Jon Maloy <jmaloy@redhat.com> wrote:
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of ~15 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset not supported by kernel.
>
> jmaloy@lubu:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
> [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
> [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
> [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
> [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
> [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
> [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
> [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
> [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
> [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@lubu:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@lubu:~/passt$
>
> jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset supported by kernel.
>
> jmaloy@lubu:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 40854
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.22 GBytes 10.5 Gbits/sec
> [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
> [ 5] 2.00-3.00 sec 1.22 GBytes 10.5 Gbits/sec
> [ 5] 3.00-4.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 4.00-5.00 sec 1.20 GBytes 10.3 Gbits/sec
> [ 5] 5.00-6.00 sec 1.14 GBytes 9.80 Gbits/sec
> [ 5] 6.00-7.00 sec 1.17 GBytes 10.0 Gbits/sec
> [ 5] 7.00-8.00 sec 1.12 GBytes 9.61 Gbits/sec
> [ 5] 8.00-9.00 sec 1.13 GBytes 9.74 Gbits/sec
> [ 5] 9.00-10.00 sec 1.26 GBytes 10.8 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 11.8 GBytes 10.1 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@lubu:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 27.24% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> v3: Made changes suggested by Stefano Brivio:
> - Added perf result to commit log
> - Separated parameter sanity tests from code logics
>
> v4: - Simplified sanity test further.
> 'iov_iter.count' is caclulated as the sum of the segment sizes in
> ___sys_recvmsg()->recvmsg_copy_msghdr()->copy_msghdr_from_user()->__import_iovec()->iov_iter_init()
> 'len' is the same as iov_iter.count, as returnrd in
> sock_recvmsg_nosec()->msg_data_left()->iov_iter_count()
> Hence, iov[0].iov_len cannot be larger than any of those, and no additional
> testing is necessary.
> - Improved description of passt/pasta in commit log
> - Some cosmetic changes to the iperf3/perf output
> ---
> net/ipv4/tcp.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ff6838ca2e58..50dc997b82f9 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2353,6 +2353,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> if (flags & MSG_PEEK) {
> peek_seq = tp->copied_seq;
> seq = &peek_seq;
> + if (!msg->msg_iter.__iov[0].iov_base) {
> + size_t peek_offset;
> +
> + if (msg->msg_iter.nr_segs < 2) {
> + err = -EINVAL;
> + goto out;
> + }
> + peek_offset = msg->msg_iter.__iov[0].iov_len;
> + msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> + msg->msg_iter.nr_segs -= 1;
> + msg->msg_iter.count -= peek_offset;
> + len -= peek_offset;
> + *seq += peek_offset;
> + }
> }
>
> target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
--
Stefano
prev parent reply other threads:[~2024-01-11 15:31 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-11 13:19 [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK Jon Maloy
2024-01-11 13:28 ` Jon Maloy
2024-01-11 15:31 ` Stefano Brivio [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240111163124.5d87f364@elisabeth \
--to=sbrivio@redhat.com \
--cc=dgibson@redhat.com \
--cc=jmaloy@redhat.com \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).