public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Jon Maloy <jmaloy@redhat.com>
To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
	dgibson@redhat.com, jmaloy@redhat.com
Subject: [RFC net-next v3] tcp: add support for read with offset when using MSG_PEEK
Date: Mon,  8 Jan 2024 04:16:20 -0500	[thread overview]
Message-ID: <20240108091620.786698-1-jmaloy@redhat.com> (raw)

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip the first entry,
hence letting the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments or flags.

In the iperf3 logs examples shown below, we can observe a throughput
improvement of ~15 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

'passt' is a tool used to support VMs in containers, such as KubeVirt, and
is also generally supported in libvirt/QEMU since release 9.2 / 7.2.
'pasta' is the pure namespace variety of the same tool.

jmaloy@lubu:~/passt/passt$ ../passt//net/tools/perf/perf record -g ./pasta --config-net -f
MSG_PEEK with offset not supported by kernel.

root@lubu:~/passt/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec                  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
root@lubu:~/passt/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@lubu:~/passt/passt$

jmaloy@lubu:~/passt/passt$ /home/jmaloy/passt//net/tools/perf/perf record -g ./pasta --config-net -f
MSG_PEEK with offset supported by kernel.

root@lubu:~/passt/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 40854
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.22 GBytes  10.5 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.22 GBytes  10.5 Gbits/sec
[  5]   3.00-4.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   4.00-5.00   sec  1.20 GBytes  10.3 Gbits/sec
[  5]   5.00-6.00   sec  1.14 GBytes  9.80 Gbits/sec
[  5]   6.00-7.00   sec  1.17 GBytes  10.0 Gbits/sec
[  5]   7.00-8.00   sec  1.12 GBytes  9.61 Gbits/sec
[  5]   8.00-9.00   sec  1.13 GBytes  9.74 Gbits/sec
[  5]   9.00-10.00  sec  1.26 GBytes  10.8 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  11.8 GBytes  10.1 Gbits/sec                  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@lubu:~/passt/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@lubu:~/passt/net/tools/perf$ ./perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  /home/jmaloy/passt/passt/perf.data | head -1
    46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
jmaloy@lubu:~$

With offset support:
----------------------
jmaloy@lubu:~/passt/net/tools/perf$ ./perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  /home/jmaloy/passt/passt/perf.data | head -1
   27.24%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
jmaloy@lubu:~$

Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v3:  Made changes suggested by Stefano Brivio:
    - Added perf result to commit log
    - Separated parameter sanity tests from code logics

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ff6838ca2e58..007b56dfc9e0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	size_t peek_offset;
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
@@ -2353,6 +2354,18 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			if (peek_offset >= len || msg->msg_iter.nr_segs <= 1) {
+				err = -EINVAL;
+				goto out;
+			}
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			msg->msg_iter.nr_segs -= 1;
+			msg->msg_iter.count -= peek_offset;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
@@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	size_t peek_offset;
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
@@ -2353,6 +2354,18 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			if (peek_offset >= len || msg->msg_iter.nr_segs <= 1) {
+				err = -EINVAL;
+				goto out;
+			}
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			msg->msg_iter.nr_segs -= 1;
+			msg->msg_iter.count -= peek_offset;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
2.42.0


             reply	other threads:[~2024-01-08  9:16 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-08  9:16 Jon Maloy [this message]
2024-01-10 16:12 ` [RFC net-next v3] tcp: add support for read with offset when using MSG_PEEK Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240108091620.786698-1-jmaloy@redhat.com \
    --to=jmaloy@redhat.com \
    --cc=dgibson@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).