From: Jon Maloy <jmaloy@redhat.com>
To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
dgibson@redhat.com
Subject: Re: [RFC net-next v2] tcp: add support for read with offset when using MSG_PEEK
Date: Wed, 6 Dec 2023 11:48:19 -0500 [thread overview]
Message-ID: <c36ad7d6-ff76-8ca1-609b-987dd420af5c@redhat.com> (raw)
In-Reply-To: <20231205232028.1490809-1-jmaloy@redhat.com>
Note that I only sent this one to passt-dev, not netdev.
I would appreciate feedback and possible ack/reviewed-by as soon as
possible so I can send it to netdev.
///jon
On 2023-12-05 18:20, Jon Maloy wrote:
> When reading received messages with MSG_PEEK, we sometines have to read
> the leading bytes of the stream several times, only to reach the bytes
> we really want. This is clearly non-optimal.
>
> What we would want is something similar to pread/preadv(), but working
> even for tcp sockets. At the same time, we don't want to add any new
> arguments to the recv/recvmsg() calls.
>
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
>
> In the iperf3 logs examples shown below, we can observe a throughput
> improvement of ~20 % in the direction host->namespace when using the
> protocol splicer 'passt'. This is a consistent result.
>
> $ ./passt/passt/pasta --config-net -f
> MSG_PEEK with offset not supported.
> [root@fedora37 ~]# perf record iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 60344
> [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360
> [ ID] Interval Transfer Bitrate
> {...]
> [ 6] 13.00-14.00 sec 2.54 GBytes 21.8 Gbits/sec
> [ 6] 14.00-15.00 sec 2.52 GBytes 21.7 Gbits/sec
> [ 6] 15.00-16.00 sec 2.50 GBytes 21.5 Gbits/sec
> [ 6] 16.00-17.00 sec 2.49 GBytes 21.4 Gbits/sec
> [ 6] 17.00-18.00 sec 2.51 GBytes 21.6 Gbits/sec
> [ 6] 18.00-19.00 sec 2.48 GBytes 21.3 Gbits/sec
> [ 6] 19.00-20.00 sec 2.49 GBytes 21.4 Gbits/sec
> [ 6] 20.00-20.04 sec 87.4 MBytes 19.2 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 6] 0.00-20.04 sec 48.9 GBytes 21.0 Gbits/sec receiver
> -----------------------------------------------------------
>
> [jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net -f
> MSG_PEEK with offset supported.
> [root@fedora37 ~]# perf record iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 46362
> [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374
> [ ID] Interval Transfer Bitrate
> [...]
> [ 6] 12.00-13.00 sec 3.18 GBytes 27.3 Gbits/sec
> [ 6] 13.00-14.00 sec 3.17 GBytes 27.3 Gbits/sec
> [ 6] 14.00-15.00 sec 3.13 GBytes 26.9 Gbits/sec
> [ 6] 15.00-16.00 sec 3.17 GBytes 27.3 Gbits/sec
> [ 6] 16.00-17.00 sec 3.17 GBytes 27.2 Gbits/sec
> [ 6] 17.00-18.00 sec 3.14 GBytes 27.0 Gbits/sec
> [ 6] 18.00-19.00 sec 3.17 GBytes 27.2 Gbits/sec
> [ 6] 19.00-20.00 sec 3.12 GBytes 26.8 Gbits/sec
> [ 6] 20.00-20.04 sec 119 MBytes 25.5 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 6] 0.00-20.04 sec 59.4 GBytes 25.4 Gbits/sec receiver
> -----------------------------------------------------------
>
> Passt is used to support VMs in containers, such as KubeVirt, and
> is also generally supported in libvirt/QEMU since release 9.2 / 7.2.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
> net/ipv4/tcp.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53bcc17c91e4..e9d3b5bf2f66 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> int *cmsg_flags)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> + size_t peek_offset;
> int copied = 0;
> u32 peek_seq;
> u32 *seq;
> @@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> if (flags & MSG_PEEK) {
> peek_seq = tp->copied_seq;
> seq = &peek_seq;
> + if (!msg->msg_iter.__iov[0].iov_base) {
> + peek_offset = msg->msg_iter.__iov[0].iov_len;
> + msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> + if (msg->msg_iter.nr_segs <= 1)
> + goto out;
> + msg->msg_iter.nr_segs -= 1;
> + if (msg->msg_iter.count <= peek_offset)
> + goto out;
> + msg->msg_iter.count -= peek_offset;
> + if (len <= peek_offset)
> + goto out;
> + len -= peek_offset;
> + *seq += peek_offset;
> + }
> }
>
> target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
next prev parent reply other threads:[~2023-12-06 16:48 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-05 23:20 [RFC net-next v2] tcp: add support for read with offset when using MSG_PEEK Jon Maloy
2023-12-06 16:48 ` Jon Maloy [this message]
2023-12-06 18:02 ` Stefano Brivio
2024-01-20 16:52 jmaloy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c36ad7d6-ff76-8ca1-609b-987dd420af5c@redhat.com \
--to=jmaloy@redhat.com \
--cc=dgibson@redhat.com \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).