public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [RFC net-next] tcp: add support for read with offset when using MSG_PEEK
@ 2024-01-11 22:22 jmaloy
  0 siblings, 0 replies; 6+ messages in thread
From: jmaloy @ 2024-01-11 22:22 UTC (permalink / raw)
  To: netdev, davem; +Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson

From: Jon Maloy <jmaloy@redhat.com>

When reading received messages with MSG_PEEK, we sometines have to read
the leading bytes of the stream several times, only to reach the bytes
we really want. This is clearly non-optimal.

What we would want is something similar to pread/preadv(), but working
even for tcp sockets. At the same time, we don't want to add any new
arguments to the recv/recvmsg() calls.

In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip the first entry,
hence letting the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments or flags.

In the iperf3 logs examples shown below, we can observe a throughput
improvement of ~20 % in the direction host->namespace when using the
protocol splicer 'passt'. This is a consistent result.

$ ./passt/passt/pasta --config-net  -f
MSG_PEEK with offset not supported.
[root@fedora37 ~]# perf record iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 60344
[  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360
[ ID] Interval           Transfer     Bitrate
{...]
[  6]  13.00-14.00  sec  2.54 GBytes  21.8 Gbits/sec
[  6]  14.00-15.00  sec  2.52 GBytes  21.7 Gbits/sec
[  6]  15.00-16.00  sec  2.50 GBytes  21.5 Gbits/sec
[  6]  16.00-17.00  sec  2.49 GBytes  21.4 Gbits/sec
[  6]  17.00-18.00  sec  2.51 GBytes  21.6 Gbits/sec
[  6]  18.00-19.00  sec  2.48 GBytes  21.3 Gbits/sec
[  6]  19.00-20.00  sec  2.49 GBytes  21.4 Gbits/sec
[  6]  20.00-20.04  sec  87.4 MBytes  19.2 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  6]   0.00-20.04  sec  48.9 GBytes  21.0 Gbits/sec receiver
-----------------------------------------------------------

[jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net  -f
MSG_PEEK with offset supported.
[root@fedora37 ~]# perf record iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 46362
[  6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374
[ ID] Interval           Transfer     Bitrate
[...]
[  6]  12.00-13.00  sec  3.18 GBytes  27.3 Gbits/sec
[  6]  13.00-14.00  sec  3.17 GBytes  27.3 Gbits/sec
[  6]  14.00-15.00  sec  3.13 GBytes  26.9 Gbits/sec
[  6]  15.00-16.00  sec  3.17 GBytes  27.3 Gbits/sec
[  6]  16.00-17.00  sec  3.17 GBytes  27.2 Gbits/sec
[  6]  17.00-18.00  sec  3.14 GBytes  27.0 Gbits/sec
[  6]  18.00-19.00  sec  3.17 GBytes  27.2 Gbits/sec
[  6]  19.00-20.00  sec  3.12 GBytes  26.8 Gbits/sec
[  6]  20.00-20.04  sec   119 MBytes  25.5 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  6]   0.00-20.04  sec  59.4 GBytes  25.4 Gbits/sec receiver
-----------------------------------------------------------

Passt is used to support VMs in containers, such as KubeVirt, and
is also generally supported in libvirt/QEMU since release 9.2 / 7.2.

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jon Paul Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 53bcc17c91e4..e9d3b5bf2f66 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	size_t peek_offset;
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
@@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			if (msg->msg_iter.nr_segs <= 1)
+				goto out;
+			msg->msg_iter.nr_segs -= 1;
+			if (msg->msg_iter.count <= peek_offset)
+				goto out;
+			msg->msg_iter.count -= peek_offset;
+			if (len <= peek_offset)
+				goto out;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
@@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	size_t peek_offset;
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
@@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			if (msg->msg_iter.nr_segs <= 1)
+				goto out;
+			msg->msg_iter.nr_segs -= 1;
+			if (msg->msg_iter.count <= peek_offset)
+				goto out;
+			msg->msg_iter.count -= peek_offset;
+			if (len <= peek_offset)
+				goto out;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC net-next] tcp: add support for read with offset when using MSG_PEEK
  2024-01-21 22:16     ` Stefano Brivio
@ 2024-01-22 16:22       ` Jon Maloy
  0 siblings, 0 replies; 6+ messages in thread
From: Jon Maloy @ 2024-01-22 16:22 UTC (permalink / raw)
  To: Stefano Brivio, Paolo Abeni
  Cc: netdev, davem, kuba, passt-dev, lvivier, dgibson



On 2024-01-21 17:16, Stefano Brivio wrote:
> On Thu, 18 Jan 2024 17:22:52 -0500
> Jon Maloy <jmaloy@redhat.com> wrote:
>
>> On 2024-01-16 05:49, Paolo Abeni wrote:
>>> On Thu, 2024-01-11 at 18:00 -0500, jmaloy@redhat.com wrote:
>>>> From: Jon Maloy <jmaloy@redhat.com>
>>>>
>>>> When reading received messages from a socket with MSG_PEEK, we may want
>>>> to read the contents with an offset, like we can do with pread/preadv()
>>>> when reading files. Currently, it is not possible to do that.
>> [...]
>>>> +				err = -EINVAL;
>>>> +				goto out;
>>>> +			}
>>>> +			peek_offset = msg->msg_iter.__iov[0].iov_len;
>>>> +			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
>>>> +			msg->msg_iter.nr_segs -= 1;
>>>> +			msg->msg_iter.count -= peek_offset;
>>>> +			len -= peek_offset;
>>>> +			*seq += peek_offset;
>>>> +		}
>>> IMHO this does not look like the correct interface to expose such
>>> functionality. Doing the same with a different protocol should cause a
>>> SIGSEG or the like, right?
>> I would expect doing the same thing with a different protocol to cause
>> an EFAULT, as it should. But I haven't tried it.
> So, out of curiosity, I actually tried: the current behaviour is
> recvmsg() failing with EFAULT, only as data is received (!), for TCP
> and UDP with AF_INET, and for AF_UNIX (both datagram and stream).
>
> EFAULT, however, is not in the list of "shall fail", nor "may fail"
> conditions described by POSIX.1-2008, so there isn't really anything
> that mandates it API-wise.
>
> Likewise, POSIX doesn't require any signal to be delivered (and no
> signals are delivered on Linux in any case: note that iov_base is not
> dereferenced).
>
> For TCP sockets only, passing a NULL buffer is already supported by
> recv() with MSG_TRUNC (same here, Linux extension). This change would
> finally make recvmsg() consistent with that TCP-specific bit.
>
>> This is a change to TCP only, at least until somebody decides to
>> implement it elsewhere (why not?)
> Side note, I can't really think of a reasonable use case for UDP -- it
> doesn't quite fit with the notion of message boundaries.
>
> Even letting alone the fact that passt(1) and pasta(1) don't need this
> for UDP (no acknowledgement means no need to keep unacknowledged data
> anywhere), if another application wants to do something conceptually
> similar, we should probably target recvmmsg().
>
>>> What about using/implementing SO_PEEK_OFF support instead?
>> I looked at SO_PEEK_OFF, and it honestly looks both awkward and limited.
> I think it's rather intended to skip headers with fixed size or
> suchlike.
>
>> We would have to make frequent calls to setsockopt(), something that
>> would beat much of the purpose of this feature.
> ...right, we would need to reset the SO_PEEK_OFF value at every
> recvmsg(), which is probably even worse than the current overhead.
>
>> I stand by my opinion here.
>> This feature is simple, non-intrusive, totally backwards compatible and
>> implies no changes to the API or BPI.
> My thoughts as well, plus the advantage for our user-mode networking
> case is quite remarkable given how simple the change is.

After pondering more upon this, and also some team internal discussions, 
I have decided to give it a try with SO_PEEK_OFF, just to see to see the 
outcome, both at kernel level and in user space.
So please wait with any possible application of this , if that ever 
happens with RFCs.

///jon
>
>> I would love to hear other opinions on this, though.
>>
>> Regards
>> /jon
>>
>>> Cheers,
>>>
>>> Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC net-next] tcp: add support for read with offset when using MSG_PEEK
  2024-01-18 22:22   ` Jon Maloy
@ 2024-01-21 22:16     ` Stefano Brivio
  2024-01-22 16:22       ` Jon Maloy
  0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2024-01-21 22:16 UTC (permalink / raw)
  To: Jon Maloy, Paolo Abeni; +Cc: netdev, davem, kuba, passt-dev, lvivier, dgibson

On Thu, 18 Jan 2024 17:22:52 -0500
Jon Maloy <jmaloy@redhat.com> wrote:

> On 2024-01-16 05:49, Paolo Abeni wrote:
> > On Thu, 2024-01-11 at 18:00 -0500, jmaloy@redhat.com wrote:  
> >> From: Jon Maloy <jmaloy@redhat.com>
> >>
> >> When reading received messages from a socket with MSG_PEEK, we may want
> >> to read the contents with an offset, like we can do with pread/preadv()
> >> when reading files. Currently, it is not possible to do that.  
> [...]
> >> +				err = -EINVAL;
> >> +				goto out;
> >> +			}
> >> +			peek_offset = msg->msg_iter.__iov[0].iov_len;
> >> +			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> >> +			msg->msg_iter.nr_segs -= 1;
> >> +			msg->msg_iter.count -= peek_offset;
> >> +			len -= peek_offset;
> >> +			*seq += peek_offset;
> >> +		}  
> > IMHO this does not look like the correct interface to expose such
> > functionality. Doing the same with a different protocol should cause a
> > SIGSEG or the like, right?  
>
> I would expect doing the same thing with a different protocol to cause 
> an EFAULT, as it should. But I haven't tried it.

So, out of curiosity, I actually tried: the current behaviour is
recvmsg() failing with EFAULT, only as data is received (!), for TCP
and UDP with AF_INET, and for AF_UNIX (both datagram and stream).

EFAULT, however, is not in the list of "shall fail", nor "may fail"
conditions described by POSIX.1-2008, so there isn't really anything
that mandates it API-wise.

Likewise, POSIX doesn't require any signal to be delivered (and no
signals are delivered on Linux in any case: note that iov_base is not
dereferenced).

For TCP sockets only, passing a NULL buffer is already supported by
recv() with MSG_TRUNC (same here, Linux extension). This change would
finally make recvmsg() consistent with that TCP-specific bit.

> This is a change to TCP only, at least until somebody decides to 
> implement it elsewhere (why not?)

Side note, I can't really think of a reasonable use case for UDP -- it
doesn't quite fit with the notion of message boundaries.

Even letting alone the fact that passt(1) and pasta(1) don't need this
for UDP (no acknowledgement means no need to keep unacknowledged data
anywhere), if another application wants to do something conceptually
similar, we should probably target recvmmsg().

> > What about using/implementing SO_PEEK_OFF support instead?
>
> I looked at SO_PEEK_OFF, and it honestly looks both awkward and limited.

I think it's rather intended to skip headers with fixed size or
suchlike.

> We would have to make frequent calls to setsockopt(), something that 
> would beat much of the purpose of this feature.

...right, we would need to reset the SO_PEEK_OFF value at every
recvmsg(), which is probably even worse than the current overhead.

> I stand by my opinion here.
> This feature is simple, non-intrusive, totally backwards compatible and 
> implies no changes to the API or BPI.

My thoughts as well, plus the advantage for our user-mode networking
case is quite remarkable given how simple the change is.

> I would love to hear other opinions on this, though.
> 
> Regards
> /jon
> 
> >
> > Cheers,
> >
> > Paolo

-- 
Stefano


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC net-next] tcp: add support for read with offset when using MSG_PEEK
  2024-01-16 10:49 ` Paolo Abeni
@ 2024-01-18 22:22   ` Jon Maloy
  2024-01-21 22:16     ` Stefano Brivio
  0 siblings, 1 reply; 6+ messages in thread
From: Jon Maloy @ 2024-01-18 22:22 UTC (permalink / raw)
  To: Paolo Abeni, netdev, davem; +Cc: kuba, passt-dev, sbrivio, lvivier, dgibson



On 2024-01-16 05:49, Paolo Abeni wrote:
> On Thu, 2024-01-11 at 18:00 -0500, jmaloy@redhat.com wrote:
>> From: Jon Maloy <jmaloy@redhat.com>
>>
>> When reading received messages from a socket with MSG_PEEK, we may want
>> to read the contents with an offset, like we can do with pread/preadv()
>> when reading files. Currently, it is not possible to do that.
[...]
>> +				err = -EINVAL;
>> +				goto out;
>> +			}
>> +			peek_offset = msg->msg_iter.__iov[0].iov_len;
>> +			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
>> +			msg->msg_iter.nr_segs -= 1;
>> +			msg->msg_iter.count -= peek_offset;
>> +			len -= peek_offset;
>> +			*seq += peek_offset;
>> +		}
> IMHO this does not look like the correct interface to expose such
> functionality. Doing the same with a different protocol should cause a
> SIGSEG or the like, right?
I would expect doing the same thing with a different protocol to cause 
an EFAULT, as it should. But I haven't tried it.
This is a change to TCP only, at least until somebody decides to 
implement it elsewhere (why not?)
>
> What about using/implementing SO_PEEK_OFF support instead?
I looked at SO_PEEK_OFF, and it honestly looks both awkward and limited.
We would have to make frequent calls to setsockopt(), something that 
would beat much of the purpose of this feature.
I stand by my opinion here.
This feature is simple, non-intrusive, totally backwards compatible and 
implies no changes to the API or BPI.

I would love to hear other opinions on this, though.

Regards
/jon


>
> Cheers,
>
> Paolo
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC net-next] tcp: add support for read with offset when using MSG_PEEK
  2024-01-11 23:00 jmaloy
@ 2024-01-16 10:49 ` Paolo Abeni
  2024-01-18 22:22   ` Jon Maloy
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Abeni @ 2024-01-16 10:49 UTC (permalink / raw)
  To: jmaloy, netdev, davem; +Cc: kuba, passt-dev, sbrivio, lvivier, dgibson

On Thu, 2024-01-11 at 18:00 -0500, jmaloy@redhat.com wrote:
> From: Jon Maloy <jmaloy@redhat.com>
> 
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
> 
> In this commit, we allow the user to set iovec.iov_base in the first
> vector entry to NULL. This tells the socket to skip the first entry,
> hence letting the iov_len field of that entry indicate the offset value.
> This way, there is no need to add any new arguments or flags.
> 
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of ~15 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
> 
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
> 
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
> 
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
> 
> passt and pasta are supported in KubeVirt and libvirt/qemu.
> 
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset not supported by kernel.
> 
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
> [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
> [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
> [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
> [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
> [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
> [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
> [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
> [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
> [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
> [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@freyr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@freyr:~/passt$
> 
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> MSG_PEEK with offset supported by kernel.
> 
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 40854
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.22 GBytes  10.5 Gbits/sec
> [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
> [  5]   2.00-3.00   sec  1.22 GBytes  10.5 Gbits/sec
> [  5]   3.00-4.00   sec  1.11 GBytes  9.56 Gbits/sec
> [  5]   4.00-5.00   sec  1.20 GBytes  10.3 Gbits/sec
> [  5]   5.00-6.00   sec  1.14 GBytes  9.80 Gbits/sec
> [  5]   6.00-7.00   sec  1.17 GBytes  10.0 Gbits/sec
> [  5]   7.00-8.00   sec  1.12 GBytes  9.61 Gbits/sec
> [  5]   8.00-9.00   sec  1.13 GBytes  9.74 Gbits/sec
> [  5]   9.00-10.00  sec  1.26 GBytes  10.8 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  11.8 GBytes  10.1 Gbits/sec   receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@freyr:~/passt$
> 
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
> 
> Without offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
>     46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> With offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
>    27.24%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> 
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  net/ipv4/tcp.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1baa484d2190..82e1da3f0f98 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2351,6 +2351,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>  	if (flags & MSG_PEEK) {
>  		peek_seq = tp->copied_seq;
>  		seq = &peek_seq;
> +		if (!msg->msg_iter.__iov[0].iov_base) {
> +			size_t peek_offset;
> +
> +			if (msg->msg_iter.nr_segs < 2) {
> +				err = -EINVAL;
> +				goto out;
> +			}
> +			peek_offset = msg->msg_iter.__iov[0].iov_len;
> +			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
> +			msg->msg_iter.nr_segs -= 1;
> +			msg->msg_iter.count -= peek_offset;
> +			len -= peek_offset;
> +			*seq += peek_offset;
> +		}

IMHO this does not look like the correct interface to expose such
functionality. Doing the same with a different protocol should cause a
SIGSEG or the like, right?

What about using/implementing SO_PEEK_OFF support instead? 

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC net-next] tcp: add support for read with offset when using MSG_PEEK
@ 2024-01-11 23:00 jmaloy
  2024-01-16 10:49 ` Paolo Abeni
  0 siblings, 1 reply; 6+ messages in thread
From: jmaloy @ 2024-01-11 23:00 UTC (permalink / raw)
  To: netdev, davem; +Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson

From: Jon Maloy <jmaloy@redhat.com>

When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.

In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip the first entry,
hence letting the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments or flags.

In the iperf3 log examples shown below, we can observe a throughput
improvement of ~15 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.

pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).

Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.

At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.

passt and pasta are supported in KubeVirt and libvirt/qemu.

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
MSG_PEEK with offset not supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
[  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
[  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
[  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
[  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
[  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
[  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
[  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
[  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
[  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$

jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
MSG_PEEK with offset supported by kernel.

jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 40854
[  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.22 GBytes  10.5 Gbits/sec
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
[  5]   2.00-3.00   sec  1.22 GBytes  10.5 Gbits/sec
[  5]   3.00-4.00   sec  1.11 GBytes  9.56 Gbits/sec
[  5]   4.00-5.00   sec  1.20 GBytes  10.3 Gbits/sec
[  5]   5.00-6.00   sec  1.14 GBytes  9.80 Gbits/sec
[  5]   6.00-7.00   sec  1.17 GBytes  10.0 Gbits/sec
[  5]   7.00-8.00   sec  1.12 GBytes  9.61 Gbits/sec
[  5]   8.00-9.00   sec  1.13 GBytes  9.74 Gbits/sec
[  5]   9.00-10.00  sec  1.26 GBytes  10.8 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.04  sec  11.8 GBytes  10.1 Gbits/sec   receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$

The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.

Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
    46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
   27.24%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 net/ipv4/tcp.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1baa484d2190..82e1da3f0f98 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2351,6 +2351,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			size_t peek_offset;
+
+			if (msg->msg_iter.nr_segs < 2) {
+				err = -EINVAL;
+				goto out;
+			}
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			msg->msg_iter.nr_segs -= 1;
+			msg->msg_iter.count -= peek_offset;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
@@ -2351,6 +2351,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
 		seq = &peek_seq;
+		if (!msg->msg_iter.__iov[0].iov_base) {
+			size_t peek_offset;
+
+			if (msg->msg_iter.nr_segs < 2) {
+				err = -EINVAL;
+				goto out;
+			}
+			peek_offset = msg->msg_iter.__iov[0].iov_len;
+			msg->msg_iter.__iov = &msg->msg_iter.__iov[1];
+			msg->msg_iter.nr_segs -= 1;
+			msg->msg_iter.count -= peek_offset;
+			len -= peek_offset;
+			*seq += peek_offset;
+		}
 	}
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-01-22 16:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-11 22:22 [RFC net-next] tcp: add support for read with offset when using MSG_PEEK jmaloy
2024-01-11 23:00 jmaloy
2024-01-16 10:49 ` Paolo Abeni
2024-01-18 22:22   ` Jon Maloy
2024-01-21 22:16     ` Stefano Brivio
2024-01-22 16:22       ` Jon Maloy

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).