* [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK
@ 2024-01-11 13:19 Jon Maloy
2024-01-11 13:28 ` Jon Maloy
2024-01-11 15:31 ` Stefano Brivio
0 siblings, 2 replies; 3+ messages in thread
From: Jon Maloy @ 2024-01-11 13:19 UTC (permalink / raw)
To: passt-dev, sbrivio, lvivier, dgibson, jmaloy
When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.
In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip the first entry,
hence letting the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments or flags.
In the iperf3 log examples shown below, we can observe a throughput
improvement of ~15 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.
pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).
Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.
At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.
passt and pasta are supported in KubeVirt and libvirt/qemu.
jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
MSG_PEEK with offset not supported by kernel.
jmaloy@lubu:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
[ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
[ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
[ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
[ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
[ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
[ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
[ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
[ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
[ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
[ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@lubu:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@lubu:~/passt$
jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
MSG_PEEK with offset supported by kernel.
jmaloy@lubu:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 40854
[ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.22 GBytes 10.5 Gbits/sec
[ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
[ 5] 2.00-3.00 sec 1.22 GBytes 10.5 Gbits/sec
[ 5] 3.00-4.00 sec 1.11 GBytes 9.56 Gbits/sec
[ 5] 4.00-5.00 sec 1.20 GBytes 10.3 Gbits/sec
[ 5] 5.00-6.00 sec 1.14 GBytes 9.80 Gbits/sec
[ 5] 6.00-7.00 sec 1.17 GBytes 10.0 Gbits/sec
[ 5] 7.00-8.00 sec 1.12 GBytes 9.61 Gbits/sec
[ 5] 8.00-9.00 sec 1.13 GBytes 9.74 Gbits/sec
[ 5] 9.00-10.00 sec 1.26 GBytes 10.8 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 11.8 GBytes 10.1 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@lubu:~/passt$
The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.
Without offset support:
----------------------
jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
With offset support:
----------------------
jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
27.24% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
v3: Made changes suggested by Stefano Brivio:
- Added perf result to commit log
- Separated parameter sanity tests from code logics
v4: - Simplified sanity test further.
'iov_iter.count' is caclulated as the sum of the segment sizes in
___sys_recvmsg()->recvmsg_copy_msghdr()->copy_msghdr_from_user()->__import_iovec()->iov_iter_init()
'len' is the same as iov_iter.count, as returnrd in
sock_recvmsg_nosec()->msg_data_left()->iov_iter_count()
Hence, iov[0].iov_len cannot be larger than any of those, and no additional
testing is necessary.
- Improved description of passt/pasta in commit log
- Some cosmetic changes to the iperf3/perf output
---
net/ipv4/tcp.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ff6838ca2e58..50dc997b82f9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2353,6 +2353,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
if (flags & MSG_PEEK) {
peek_seq = tp->copied_seq;
seq = &peek_seq;
+ if (!msg->msg_iter.__iov[0].iov_base) {
+ size_t peek_offset;
+
+ if (msg->msg_iter.nr_segs < 2) {
+ err = -EINVAL;
+ goto out;
+ }
+ peek_offset = msg->msg_iter.__iov[0].iov_len;
+ msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+ msg->msg_iter.nr_segs -= 1;
+ msg->msg_iter.count -= peek_offset;
+ len -= peek_offset;
+ *seq += peek_offset;
+ }
}
target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
--
@@ -2353,6 +2353,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
if (flags & MSG_PEEK) {
peek_seq = tp->copied_seq;
seq = &peek_seq;
+ if (!msg->msg_iter.__iov[0].iov_base) {
+ size_t peek_offset;
+
+ if (msg->msg_iter.nr_segs < 2) {
+ err = -EINVAL;
+ goto out;
+ }
+ peek_offset = msg->msg_iter.__iov[0].iov_len;
+ msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+ msg->msg_iter.nr_segs -= 1;
+ msg->msg_iter.count -= peek_offset;
+ len -= peek_offset;
+ *seq += peek_offset;
+ }
}
target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
--
2.42.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK
2024-01-11 13:19 [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK Jon Maloy
@ 2024-01-11 13:28 ` Jon Maloy
2024-01-11 15:31 ` Stefano Brivio
1 sibling, 0 replies; 3+ messages in thread
From: Jon Maloy @ 2024-01-11 13:28 UTC (permalink / raw)
To: passt-dev, sbrivio, lvivier, dgibson
On 2024-01-11 08:19, Jon Maloy wrote:
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of ~15 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset not supported by kernel.
>
> jmaloy@lubu:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
> [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
> [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
> [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
> [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
> [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
> [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
> [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
> [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
> [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@lubu:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@lubu:~/passt$
>
> jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset supported by kernel.
>
> jmaloy@lubu:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 40854
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.22 GBytes 10.5 Gbits/sec
> [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
> [ 5] 2.00-3.00 sec 1.22 GBytes 10.5 Gbits/sec
> [ 5] 3.00-4.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 4.00-5.00 sec 1.20 GBytes 10.3 Gbits/sec
> [ 5] 5.00-6.00 sec 1.14 GBytes 9.80 Gbits/sec
> [ 5] 6.00-7.00 sec 1.17 GBytes 10.0 Gbits/sec
> [ 5] 7.00-8.00 sec 1.12 GBytes 9.61 Gbits/sec
> [ 5] 8.00-9.00 sec 1.13 GBytes 9.74 Gbits/sec
> [ 5] 9.00-10.00 sec 1.26 GBytes 10.8 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 11.8 GBytes 10.1 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@lubu:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 27.24% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
>
> ---
> v3: Made changes suggested by Stefano Brivio:
> - Added perf result to commit log
> - Separated parameter sanity tests from code logics
>
> v4: - Simplified sanity test further.
> 'iov_iter.count' is caclulated as the sum of the segment sizes in
> ___sys_recvmsg()->recvmsg_copy_msghdr()->copy_msghdr_from_user()->__import_iovec()->iov_iter_init()
> 'len' is the same as iov_iter.count, as returnrd in
> sock_recvmsg_nosec()->msg_data_left()->iov_iter_count()
> Hence, iov[0].iov_len cannot be larger than any of those, and no additional
> testing is necessary.
> - Improved description of passt/pasta in commit log
> - Some cosmetic changes to the iperf3/perf output
Hi Stefano,
I think we are very close now. If I can get yourack on this today I will
send it to net-next, and then we'll see...
///jon
> ---
> net/ipv4/tcp.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ff6838ca2e58..50dc997b82f9 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2353,6 +2353,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> if (flags & MSG_PEEK) {
> peek_seq = tp->copied_seq;
> seq = &peek_seq;
> + if (!msg->msg_iter.__iov[0].iov_base) {
> + size_t peek_offset;
> +
> + if (msg->msg_iter.nr_segs < 2) {
> + err = -EINVAL;
> + goto out;
> + }
> + peek_offset = msg->msg_iter.__iov[0].iov_len;
> + msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> + msg->msg_iter.nr_segs -= 1;
> + msg->msg_iter.count -= peek_offset;
> + len -= peek_offset;
> + *seq += peek_offset;
> + }
> }
>
> target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK
2024-01-11 13:19 [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK Jon Maloy
2024-01-11 13:28 ` Jon Maloy
@ 2024-01-11 15:31 ` Stefano Brivio
1 sibling, 0 replies; 3+ messages in thread
From: Stefano Brivio @ 2024-01-11 15:31 UTC (permalink / raw)
To: Jon Maloy; +Cc: passt-dev, lvivier, dgibson
On Thu, 11 Jan 2024 08:19:17 -0500
Jon Maloy <jmaloy@redhat.com> wrote:
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of ~15 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset not supported by kernel.
>
> jmaloy@lubu:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
> [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
> [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
> [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
> [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
> [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
> [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
> [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
> [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
> [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@lubu:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@lubu:~/passt$
>
> jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset supported by kernel.
>
> jmaloy@lubu:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 40854
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.22 GBytes 10.5 Gbits/sec
> [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
> [ 5] 2.00-3.00 sec 1.22 GBytes 10.5 Gbits/sec
> [ 5] 3.00-4.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 4.00-5.00 sec 1.20 GBytes 10.3 Gbits/sec
> [ 5] 5.00-6.00 sec 1.14 GBytes 9.80 Gbits/sec
> [ 5] 6.00-7.00 sec 1.17 GBytes 10.0 Gbits/sec
> [ 5] 7.00-8.00 sec 1.12 GBytes 9.61 Gbits/sec
> [ 5] 8.00-9.00 sec 1.13 GBytes 9.74 Gbits/sec
> [ 5] 9.00-10.00 sec 1.26 GBytes 10.8 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 11.8 GBytes 10.1 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@lubu:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 27.24% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> v3: Made changes suggested by Stefano Brivio:
> - Added perf result to commit log
> - Separated parameter sanity tests from code logics
>
> v4: - Simplified sanity test further.
> 'iov_iter.count' is caclulated as the sum of the segment sizes in
> ___sys_recvmsg()->recvmsg_copy_msghdr()->copy_msghdr_from_user()->__import_iovec()->iov_iter_init()
> 'len' is the same as iov_iter.count, as returnrd in
> sock_recvmsg_nosec()->msg_data_left()->iov_iter_count()
> Hence, iov[0].iov_len cannot be larger than any of those, and no additional
> testing is necessary.
> - Improved description of passt/pasta in commit log
> - Some cosmetic changes to the iperf3/perf output
> ---
> net/ipv4/tcp.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ff6838ca2e58..50dc997b82f9 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2353,6 +2353,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> if (flags & MSG_PEEK) {
> peek_seq = tp->copied_seq;
> seq = &peek_seq;
> + if (!msg->msg_iter.__iov[0].iov_base) {
> + size_t peek_offset;
> +
> + if (msg->msg_iter.nr_segs < 2) {
> + err = -EINVAL;
> + goto out;
> + }
> + peek_offset = msg->msg_iter.__iov[0].iov_len;
> + msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> + msg->msg_iter.nr_segs -= 1;
> + msg->msg_iter.count -= peek_offset;
> + len -= peek_offset;
> + *seq += peek_offset;
> + }
> }
>
> target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
--
Stefano
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-01-11 15:31 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-11 13:19 [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK Jon Maloy
2024-01-11 13:28 ` Jon Maloy
2024-01-11 15:31 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).