public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [RFC net-next v2] tcp: add support for read with offset when using MSG_PEEK
@ 2024-01-20 16:52 jmaloy
  0 siblings, 0 replies; 4+ messages in thread
From: jmaloy @ 2024-01-20 16:52 UTC (permalink / raw)
  To: netdev, davem; +Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson

From: Jon Maloy <jmaloy@redhat.com>

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip the first entry,
hence letting the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments or flags.

In the iperf3 log examples shown below, we can observe a throughput
improvement of ~15 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
MSG_PEEK with offset not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
MSG_PEEK with offset supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 40854
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.22 GBytes  10.5 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.22 GBytes  10.5 Gbits/sec
[  5]   3.00-4.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   4.00-5.00   sec  1.20 GBytes  10.3 Gbits/sec
[  5]   5.00-6.00   sec  1.14 GBytes  9.80 Gbits/sec
[  5]   6.00-7.00   sec  1.17 GBytes  10.0 Gbits/sec
[  5]   7.00-8.00   sec  1.12 GBytes  9.61 Gbits/sec
[  5]   8.00-9.00   sec  1.13 GBytes  9.74 Gbits/sec
[  5]   9.00-10.00  sec  1.26 GBytes  10.8 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  11.8 GBytes  10.1 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
    46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
   27.24%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

Signed-off-by: Jon Maloy <jmaloy@redhat.com>

---
v2: Put test of msg->msg_iter.nr_segs before test on msg->msg_iter.__iov,
    since the latter may be uninitialized when other receive functions
    are used. Reported by Martin Zaharinov.
---
 net/ipv4/tcp.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fce5668a6a3d..e8fdf3617377 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2351,6 +2351,16 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (msg->msg_iter.nr_segs > 1 && !msg->msg_iter.__iov[0].iov_base) {
+			size_t peek_offset;
+
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			msg->msg_iter.nr_segs -= 1;
+			msg->msg_iter.count -= peek_offset;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
@@ -2351,6 +2351,16 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (msg->msg_iter.nr_segs > 1 && !msg->msg_iter.__iov[0].iov_base) {
+			size_t peek_offset;
+
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			msg->msg_iter.nr_segs -= 1;
+			msg->msg_iter.count -= peek_offset;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC net-next v2] tcp: add support for read with offset when using MSG_PEEK
  2023-12-05 23:20 Jon Maloy
  2023-12-06 16:48 ` Jon Maloy
@ 2023-12-06 18:02 ` Stefano Brivio
  1 sibling, 0 replies; 4+ messages in thread
From: Stefano Brivio @ 2023-12-06 18:02 UTC (permalink / raw)
  To: Jon Maloy; +Cc: passt-dev, lvivier, dgibson

On Tue,  5 Dec 2023 18:20:28 -0500
Jon Maloy <jmaloy@redhat.com> wrote:

> When reading received messages with MSG_PEEK, we sometines have to read
> the leading bytes of the stream several times, only to reach the bytes
> we really want. This is clearly non-optimal.

I'm not sure there are many other usage patterns like this outside
passt -- I would simply state that if we want to peek with an offset,
we can't. And perhaps explain why passt(1) and pasta(1) need to do that.

> What we would want is something similar to pread/preadv(), but working
> even for tcp sockets. At the same time, we don't want to add any new
> arguments to the recv/recvmsg() calls.
> 
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
> 
> In the iperf3 logs examples shown below, we can observe a throughput
> improvement of ~20 % in the direction host->namespace when using the
> protocol splicer 'passt'. This is a consistent result.

I'm not sure how widely known it is, I would add a link
(https://passt.top).

> $ ./passt/passt/pasta --config-net  -f
> MSG_PEEK with offset not supported.
> [root@fedora37 ~]# perf record iperf3 -s

Here you're profiling iperf3 (not pasta), but not showing the results
of the profiling. Indeed, if you have a consistent throughput
improvement, that's also great (and great to show), but there's no need
to profile iperf3 -- I don't expect any difference from its point of
view.

> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 60344
> [  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360
> [ ID] Interval           Transfer     Bitrate
> {...]
> [  6]  13.00-14.00  sec  2.54 GBytes  21.8 Gbits/sec
> [  6]  14.00-15.00  sec  2.52 GBytes  21.7 Gbits/sec
> [  6]  15.00-16.00  sec  2.50 GBytes  21.5 Gbits/sec
> [  6]  16.00-17.00  sec  2.49 GBytes  21.4 Gbits/sec
> [  6]  17.00-18.00  sec  2.51 GBytes  21.6 Gbits/sec
> [  6]  18.00-19.00  sec  2.48 GBytes  21.3 Gbits/sec
> [  6]  19.00-20.00  sec  2.49 GBytes  21.4 Gbits/sec
> [  6]  20.00-20.04  sec  87.4 MBytes  19.2 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  6]   0.00-20.04  sec  48.9 GBytes  21.0 Gbits/sec receiver
> -----------------------------------------------------------
> 
> [jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net  -f
> MSG_PEEK with offset supported.
> [root@fedora37 ~]# perf record iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 46362
> [  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374
> [ ID] Interval           Transfer     Bitrate
> [...]
> [  6]  12.00-13.00  sec  3.18 GBytes  27.3 Gbits/sec
> [  6]  13.00-14.00  sec  3.17 GBytes  27.3 Gbits/sec
> [  6]  14.00-15.00  sec  3.13 GBytes  26.9 Gbits/sec
> [  6]  15.00-16.00  sec  3.17 GBytes  27.3 Gbits/sec
> [  6]  16.00-17.00  sec  3.17 GBytes  27.2 Gbits/sec
> [  6]  17.00-18.00  sec  3.14 GBytes  27.0 Gbits/sec
> [  6]  18.00-19.00  sec  3.17 GBytes  27.2 Gbits/sec
> [  6]  19.00-20.00  sec  3.12 GBytes  26.8 Gbits/sec
> [  6]  20.00-20.04  sec   119 MBytes  25.5 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  6]   0.00-20.04  sec  59.4 GBytes  25.4 Gbits/sec receiver
> -----------------------------------------------------------

...that is, what I personally find more conclusive is that the overhead
spent in ____sys_recvmsg(), or tcp_recvmsg_locked(), decreases
dramatically with this:

$ perf record -g ./passt -f -t 5201

[...]

$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data.old_peek | head -1
    57.16%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data.new_peek | head -1
    38.66%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

those command lines are a bit convoluted. I guess running pasta or
passt with 'perf stat' and selecting 'cycles' event for the interesting
symbols might be more obvious.

> Passt is used to support VMs in containers, such as KubeVirt, and
> is also generally supported in libvirt/QEMU since release 9.2 / 7.2.

Not just VMs in containers... but yes, that was the original use case.
I find it a bit confusing that you're using pasta(1) in the example but
mentioning passt(1) (perhaps it's my fault ;)) -- maybe mention that
pasta(1) is used with containers (Podman?) instead?

> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  net/ipv4/tcp.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53bcc17c91e4..e9d3b5bf2f66 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  			      int *cmsg_flags)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> +	size_t peek_offset;

This could be moved where it's needed.

>  	int copied = 0;
>  	u32 peek_seq;
>  	u32 *seq;
> @@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  	if (flags & MSG_PEEK) {
>  		peek_seq = tp->copied_seq;
>  		seq = &peek_seq;
> +		if (!msg->msg_iter.__iov[0].iov_base) {
> +			peek_offset = msg->msg_iter.__iov[0].iov_len;
> +			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> +			if (msg->msg_iter.nr_segs <= 1)
> +				goto out;

'err' shouldn't be ENOTCONN here (that's why I got that cryptic error
when I messed up recvmsg() while reviewing the other patch). EINVAL
would make more sense. I haven't checked the other cases.

> +			msg->msg_iter.nr_segs -= 1;
> +			if (msg->msg_iter.count <= peek_offset)
> +				goto out;

I find it a bit difficult to follow these checks interleaved with
assignments. That is, I've been wondering for a while why you would
want to check for msg_iter.count <= peek_offset only after decreasing
the number of segments, only to find out that there's actually no
relationship between the two things.

Maybe newlines between different parts of the overall logic would help.

> +			msg->msg_iter.count -= peek_offset;
> +			if (len <= peek_offset)
> +				goto out;
> +			len -= peek_offset;
> +			*seq += peek_offset;
> +		}
>  	}
>  
>  	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);

-- 
Stefano


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC net-next v2] tcp: add support for read with offset when using MSG_PEEK
  2023-12-05 23:20 Jon Maloy
@ 2023-12-06 16:48 ` Jon Maloy
  2023-12-06 18:02 ` Stefano Brivio
  1 sibling, 0 replies; 4+ messages in thread
From: Jon Maloy @ 2023-12-06 16:48 UTC (permalink / raw)
  To: passt-dev, sbrivio, lvivier, dgibson

Note that I only sent this one to passt-dev, not netdev.
I would appreciate feedback and possible ack/reviewed-by  as soon as 
possible so I can send it to netdev.

///jon

On 2023-12-05 18:20, Jon Maloy wrote:
> When reading received messages with MSG_PEEK, we sometines have to read
> the leading bytes of the stream several times, only to reach the bytes
> we really want. This is clearly non-optimal.
>
> What we would want is something similar to pread/preadv(), but working
> even for tcp sockets. At the same time, we don't want to add any new
> arguments to the recv/recvmsg() calls.
>
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
>
> In the iperf3 logs examples shown below, we can observe a throughput
> improvement of ~20 % in the direction host->namespace when using the
> protocol splicer 'passt'. This is a consistent result.
>
> $ ./passt/passt/pasta --config-net  -f
> MSG_PEEK with offset not supported.
> [root@fedora37 ~]# perf record iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 60344
> [  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360
> [ ID] Interval           Transfer     Bitrate
> {...]
> [  6]  13.00-14.00  sec  2.54 GBytes  21.8 Gbits/sec
> [  6]  14.00-15.00  sec  2.52 GBytes  21.7 Gbits/sec
> [  6]  15.00-16.00  sec  2.50 GBytes  21.5 Gbits/sec
> [  6]  16.00-17.00  sec  2.49 GBytes  21.4 Gbits/sec
> [  6]  17.00-18.00  sec  2.51 GBytes  21.6 Gbits/sec
> [  6]  18.00-19.00  sec  2.48 GBytes  21.3 Gbits/sec
> [  6]  19.00-20.00  sec  2.49 GBytes  21.4 Gbits/sec
> [  6]  20.00-20.04  sec  87.4 MBytes  19.2 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  6]   0.00-20.04  sec  48.9 GBytes  21.0 Gbits/sec receiver
> -----------------------------------------------------------
>
> [jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net  -f
> MSG_PEEK with offset supported.
> [root@fedora37 ~]# perf record iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 46362
> [  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374
> [ ID] Interval           Transfer     Bitrate
> [...]
> [  6]  12.00-13.00  sec  3.18 GBytes  27.3 Gbits/sec
> [  6]  13.00-14.00  sec  3.17 GBytes  27.3 Gbits/sec
> [  6]  14.00-15.00  sec  3.13 GBytes  26.9 Gbits/sec
> [  6]  15.00-16.00  sec  3.17 GBytes  27.3 Gbits/sec
> [  6]  16.00-17.00  sec  3.17 GBytes  27.2 Gbits/sec
> [  6]  17.00-18.00  sec  3.14 GBytes  27.0 Gbits/sec
> [  6]  18.00-19.00  sec  3.17 GBytes  27.2 Gbits/sec
> [  6]  19.00-20.00  sec  3.12 GBytes  26.8 Gbits/sec
> [  6]  20.00-20.04  sec   119 MBytes  25.5 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  6]   0.00-20.04  sec  59.4 GBytes  25.4 Gbits/sec receiver
> -----------------------------------------------------------
>
> Passt is used to support VMs in containers, such as KubeVirt, and
> is also generally supported in libvirt/QEMU since release 9.2 / 7.2.
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>   net/ipv4/tcp.c | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 53bcc17c91e4..e9d3b5bf2f66 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>   			      int *cmsg_flags)
>   {
>   	struct tcp_sock *tp = tcp_sk(sk);
> +	size_t peek_offset;
>   	int copied = 0;
>   	u32 peek_seq;
>   	u32 *seq;
> @@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>   	if (flags & MSG_PEEK) {
>   		peek_seq = tp->copied_seq;
>   		seq = &peek_seq;
> +		if (!msg->msg_iter.__iov[0].iov_base) {
> +			peek_offset = msg->msg_iter.__iov[0].iov_len;
> +			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> +			if (msg->msg_iter.nr_segs <= 1)
> +				goto out;
> +			msg->msg_iter.nr_segs -= 1;
> +			if (msg->msg_iter.count <= peek_offset)
> +				goto out;
> +			msg->msg_iter.count -= peek_offset;
> +			if (len <= peek_offset)
> +				goto out;
> +			len -= peek_offset;
> +			*seq += peek_offset;
> +		}
>   	}
>   
>   	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC net-next v2] tcp: add support for read with offset when using MSG_PEEK
@ 2023-12-05 23:20 Jon Maloy
  2023-12-06 16:48 ` Jon Maloy
  2023-12-06 18:02 ` Stefano Brivio
  0 siblings, 2 replies; 4+ messages in thread
From: Jon Maloy @ 2023-12-05 23:20 UTC (permalink / raw)
  To: passt-dev, sbrivio, lvivier, dgibson, jmaloy

When reading received messages with MSG_PEEK, we sometines have to read
the leading bytes of the stream several times, only to reach the bytes
we really want. This is clearly non-optimal.

What we would want is something similar to pread/preadv(), but working
even for tcp sockets. At the same time, we don't want to add any new
arguments to the recv/recvmsg() calls.

In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip the first entry,
hence letting the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments or flags.

In the iperf3 logs examples shown below, we can observe a throughput
improvement of ~20 % in the direction host->namespace when using the
protocol splicer 'passt'. This is a consistent result.

$ ./passt/passt/pasta --config-net  -f
MSG_PEEK with offset not supported.
[root@fedora37 ~]# perf record iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 60344
[  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360
[ ID] Interval           Transfer     Bitrate
{...]
[  6]  13.00-14.00  sec  2.54 GBytes  21.8 Gbits/sec
[  6]  14.00-15.00  sec  2.52 GBytes  21.7 Gbits/sec
[  6]  15.00-16.00  sec  2.50 GBytes  21.5 Gbits/sec
[  6]  16.00-17.00  sec  2.49 GBytes  21.4 Gbits/sec
[  6]  17.00-18.00  sec  2.51 GBytes  21.6 Gbits/sec
[  6]  18.00-19.00  sec  2.48 GBytes  21.3 Gbits/sec
[  6]  19.00-20.00  sec  2.49 GBytes  21.4 Gbits/sec
[  6]  20.00-20.04  sec  87.4 MBytes  19.2 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  6]   0.00-20.04  sec  48.9 GBytes  21.0 Gbits/sec receiver
-----------------------------------------------------------

[jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net  -f
MSG_PEEK with offset supported.
[root@fedora37 ~]# perf record iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 46362
[  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374
[ ID] Interval           Transfer     Bitrate
[...]
[  6]  12.00-13.00  sec  3.18 GBytes  27.3 Gbits/sec
[  6]  13.00-14.00  sec  3.17 GBytes  27.3 Gbits/sec
[  6]  14.00-15.00  sec  3.13 GBytes  26.9 Gbits/sec
[  6]  15.00-16.00  sec  3.17 GBytes  27.3 Gbits/sec
[  6]  16.00-17.00  sec  3.17 GBytes  27.2 Gbits/sec
[  6]  17.00-18.00  sec  3.14 GBytes  27.0 Gbits/sec
[  6]  18.00-19.00  sec  3.17 GBytes  27.2 Gbits/sec
[  6]  19.00-20.00  sec  3.12 GBytes  26.8 Gbits/sec
[  6]  20.00-20.04  sec   119 MBytes  25.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  6]   0.00-20.04  sec  59.4 GBytes  25.4 Gbits/sec receiver
-----------------------------------------------------------

Passt is used to support VMs in containers, such as KubeVirt, and
is also generally supported in libvirt/QEMU since release 9.2 / 7.2.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53bcc17c91e4..e9d3b5bf2f66 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	size_t peek_offset;
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
@@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			if (msg->msg_iter.nr_segs <= 1)
+				goto out;
+			msg->msg_iter.nr_segs -= 1;
+			if (msg->msg_iter.count <= peek_offset)
+				goto out;
+			msg->msg_iter.count -= peek_offset;
+			if (len <= peek_offset)
+				goto out;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
@@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	size_t peek_offset;
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
@@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			if (msg->msg_iter.nr_segs <= 1)
+				goto out;
+			msg->msg_iter.nr_segs -= 1;
+			if (msg->msg_iter.count <= peek_offset)
+				goto out;
+			msg->msg_iter.count -= peek_offset;
+			if (len <= peek_offset)
+				goto out;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-20 16:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-20 16:52 [RFC net-next v2] tcp: add support for read with offset when using MSG_PEEK jmaloy
  -- strict thread matches above, loose matches on Subject: below --
2023-12-05 23:20 Jon Maloy
2023-12-06 16:48 ` Jon Maloy
2023-12-06 18:02 ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).