* [RFC net-next] tcp: add support for SO_PEEK_OFF
@ 2024-02-01 21:32 jmaloy
2024-02-01 21:47 ` Stefano Brivio
2024-02-07 4:39 ` David Gibson
0 siblings, 2 replies; 3+ messages in thread
From: jmaloy @ 2024-02-01 21:32 UTC (permalink / raw)
To: netdev, davem; +Cc: kuba, passt-dev, jmaloy, sbrivio, lvivier, dgibson
From: Jon Maloy <jmaloy@redhat.com>
When reading received messages from a socket with MSG_PEEK, we may want
to read the contents with an offset, like we can do with pread/preadv()
when reading files. Currently, it is not possible to do that.
In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
in a similar way it is done for Unix Domain sockets.
In the iperf3 log examples shown below, we can observe a throughput
improvement of 15-20 % in the direction host->namespace when using the
protocol splicer 'pasta' (https://passt.top).
This is a consistent result.
pasta(1) and passt(1) implement user-mode networking for network
namespaces (containers) and virtual machines by means of a translation
layer between Layer-2 network interface and native Layer-4 sockets
(TCP, UDP, ICMP/ICMPv6 echo).
Received, pending TCP data to the container/guest is kept in kernel
buffers until acknowledged, so the tool routinely needs to fetch new
data from socket, skipping data that was already sent.
At the moment this is implemented using a dummy buffer passed to
recvmsg(). With this change, we don't need a dummy buffer and the
related buffer copy (copy_to_user()) anymore.
passt and pasta are supported in KubeVirt and libvirt/qemu.
jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF not supported by kernel.
jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 44822
[ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
[ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
[ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
[ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
[ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
[ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
[ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
[ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
[ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
[ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
[ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
jmaloy@freyr:~/passt#
logout
[ perf record: Woken up 23 times to write data ]
[ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
jmaloy@freyr:~/passt$
jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
SO_PEEK_OFF supported by kernel.
jmaloy@freyr:~/passt# iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 192.168.122.1, port 52084
[ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec
[ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
[ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec
[ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec
[ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec
[ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec
[ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec
[ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec
[ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec
[ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec
[ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
^Ciperf3: interrupt - the server has terminated
logout
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
jmaloy@freyr:~/passt$
The perf record confirms this result. Below, we can observe that the
CPU spends significantly less time in the function ____sys_recvmsg()
when we have offset support.
Without offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
With offset support:
----------------------
jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
include/net/tcp.h | 1 +
net/ipv4/af_inet.c | 1 +
net/ipv4/tcp.c | 25 +++++++++++++++++++------
3 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 87f0e6c2e1f2..7eca7f2ac102 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -357,6 +357,7 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family);
ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags);
+int tcp_set_peek_offset(struct sock *sk, int val);
struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
bool force_schedule);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index fb81de10d332..7a8b3a91257f 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1068,6 +1068,7 @@ const struct proto_ops inet_stream_ops = {
#endif
.splice_eof = inet_splice_eof,
.splice_read = tcp_splice_read,
+ .set_peek_off = tcp_set_peek_offset,
.read_sock = tcp_read_sock,
.read_skb = tcp_read_skb,
.sendmsg_locked = tcp_sendmsg_locked,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fce5668a6a3d..33ade88633de 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
}
EXPORT_SYMBOL(tcp_splice_read);
+int tcp_set_peek_offset(struct sock *sk, int val)
+{
+ WRITE_ONCE(sk->sk_peek_off, val);
+
+ return 0;
+}
+EXPORT_SYMBOL(tcp_set_peek_offset);
+
struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
bool force_schedule)
{
@@ -2302,7 +2310,6 @@ static int tcp_inq_hint(struct sock *sk)
* tricks with *seq access order and skb->users are not required.
* Probably, code can be easily improved even more.
*/
-
static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
int flags, struct scm_timestamping_internal *tss,
int *cmsg_flags)
@@ -2317,6 +2324,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
+ u32 peek_offset = 0;
err = -ENOTCONN;
if (sk->sk_state == TCP_LISTEN)
@@ -2349,7 +2357,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
seq = &tp->copied_seq;
if (flags & MSG_PEEK) {
- peek_seq = tp->copied_seq;
+ peek_offset = max(sk_peek_offset(sk, flags), 0);
+ peek_seq = tp->copied_seq + peek_offset;
seq = &peek_seq;
}
@@ -2452,11 +2461,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
}
if ((flags & MSG_PEEK) &&
- (peek_seq - copied - urg_hole != tp->copied_seq)) {
+ (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
current->comm,
task_pid_nr(current));
- peek_seq = tp->copied_seq;
+ peek_seq = tp->copied_seq + peek_offset;
}
continue;
@@ -2497,7 +2506,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
WRITE_ONCE(*seq, *seq + used);
copied += used;
len -= used;
-
+ if (flags & MSG_PEEK)
+ sk_peek_offset_fwd(sk, used);
+ else
+ sk_peek_offset_bwd(sk, used);
tcp_rcv_space_adjust(sk);
skip_copy:
@@ -2774,6 +2786,7 @@ void __tcp_close(struct sock *sk, long timeout)
data_was_unread += len;
__kfree_skb(skb);
}
+ sk_set_peek_off(sk, -1);
/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
if (sk->sk_state == TCP_CLOSE)
@@ -4492,7 +4505,7 @@ void tcp_done(struct sock *sk)
reqsk_fastopen_remove(sk, req, false);
WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
-
+ sk_set_peek_off(sk, -1);
if (!sock_flag(sk, SOCK_DEAD))
sk->sk_state_change(sk);
else
--
@@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
}
EXPORT_SYMBOL(tcp_splice_read);
+int tcp_set_peek_offset(struct sock *sk, int val)
+{
+ WRITE_ONCE(sk->sk_peek_off, val);
+
+ return 0;
+}
+EXPORT_SYMBOL(tcp_set_peek_offset);
+
struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
bool force_schedule)
{
@@ -2302,7 +2310,6 @@ static int tcp_inq_hint(struct sock *sk)
* tricks with *seq access order and skb->users are not required.
* Probably, code can be easily improved even more.
*/
-
static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
int flags, struct scm_timestamping_internal *tss,
int *cmsg_flags)
@@ -2317,6 +2324,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
+ u32 peek_offset = 0;
err = -ENOTCONN;
if (sk->sk_state == TCP_LISTEN)
@@ -2349,7 +2357,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
seq = &tp->copied_seq;
if (flags & MSG_PEEK) {
- peek_seq = tp->copied_seq;
+ peek_offset = max(sk_peek_offset(sk, flags), 0);
+ peek_seq = tp->copied_seq + peek_offset;
seq = &peek_seq;
}
@@ -2452,11 +2461,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
}
if ((flags & MSG_PEEK) &&
- (peek_seq - copied - urg_hole != tp->copied_seq)) {
+ (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
current->comm,
task_pid_nr(current));
- peek_seq = tp->copied_seq;
+ peek_seq = tp->copied_seq + peek_offset;
}
continue;
@@ -2497,7 +2506,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
WRITE_ONCE(*seq, *seq + used);
copied += used;
len -= used;
-
+ if (flags & MSG_PEEK)
+ sk_peek_offset_fwd(sk, used);
+ else
+ sk_peek_offset_bwd(sk, used);
tcp_rcv_space_adjust(sk);
skip_copy:
@@ -2774,6 +2786,7 @@ void __tcp_close(struct sock *sk, long timeout)
data_was_unread += len;
__kfree_skb(skb);
}
+ sk_set_peek_off(sk, -1);
/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
if (sk->sk_state == TCP_CLOSE)
@@ -4492,7 +4505,7 @@ void tcp_done(struct sock *sk)
reqsk_fastopen_remove(sk, req, false);
WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
-
+ sk_set_peek_off(sk, -1);
if (!sock_flag(sk, SOCK_DEAD))
sk->sk_state_change(sk);
else
--
2.42.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [RFC net-next] tcp: add support for SO_PEEK_OFF
2024-02-01 21:32 [RFC net-next] tcp: add support for SO_PEEK_OFF jmaloy
@ 2024-02-01 21:47 ` Stefano Brivio
2024-02-07 4:39 ` David Gibson
1 sibling, 0 replies; 3+ messages in thread
From: Stefano Brivio @ 2024-02-01 21:47 UTC (permalink / raw)
To: jmaloy; +Cc: netdev, davem, kuba, passt-dev, lvivier, dgibson, Paolo Abeni
On Thu, 1 Feb 2024 16:32:01 -0500
jmaloy@redhat.com wrote:
> From: Jon Maloy <jmaloy@redhat.com>
>
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> in a similar way it is done for Unix Domain sockets.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of 15-20 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF not supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
> [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
> [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
> [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
> [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
> [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
> [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
> [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
> [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
> [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@freyr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@freyr:~/passt$
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 52084
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec
> [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
> [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec
> [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec
> [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec
> [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec
> [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec
> [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec
> [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec
> [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec
> [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@freyr:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
I guess this was Suggested-by: Paolo Abeni <pabeni@redhat.com> ? :)
> ---
> include/net/tcp.h | 1 +
> net/ipv4/af_inet.c | 1 +
> net/ipv4/tcp.c | 25 +++++++++++++++++++------
> 3 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 87f0e6c2e1f2..7eca7f2ac102 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -357,6 +357,7 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family);
> ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
> struct pipe_inode_info *pipe, size_t len,
> unsigned int flags);
> +int tcp_set_peek_offset(struct sock *sk, int val);
> struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
> bool force_schedule);
>
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index fb81de10d332..7a8b3a91257f 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1068,6 +1068,7 @@ const struct proto_ops inet_stream_ops = {
> #endif
> .splice_eof = inet_splice_eof,
> .splice_read = tcp_splice_read,
> + .set_peek_off = tcp_set_peek_offset,
> .read_sock = tcp_read_sock,
> .read_skb = tcp_read_skb,
> .sendmsg_locked = tcp_sendmsg_locked,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index fce5668a6a3d..33ade88633de 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
> }
> EXPORT_SYMBOL(tcp_splice_read);
>
> +int tcp_set_peek_offset(struct sock *sk, int val)
> +{
> + WRITE_ONCE(sk->sk_peek_off, val);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(tcp_set_peek_offset);
> +
> struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
> bool force_schedule)
> {
> @@ -2302,7 +2310,6 @@ static int tcp_inq_hint(struct sock *sk)
> * tricks with *seq access order and skb->users are not required.
> * Probably, code can be easily improved even more.
> */
> -
Stray change.
> static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> int flags, struct scm_timestamping_internal *tss,
> int *cmsg_flags)
> @@ -2317,6 +2324,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> long timeo;
> struct sk_buff *skb, *last;
> u32 urg_hole = 0;
> + u32 peek_offset = 0;
>
> err = -ENOTCONN;
> if (sk->sk_state == TCP_LISTEN)
> @@ -2349,7 +2357,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>
> seq = &tp->copied_seq;
> if (flags & MSG_PEEK) {
> - peek_seq = tp->copied_seq;
> + peek_offset = max(sk_peek_offset(sk, flags), 0);
> + peek_seq = tp->copied_seq + peek_offset;
> seq = &peek_seq;
> }
>
And with this, explicit support in tcp_peek_sndq() is not actually
needed, but this comment in that function:
/* XXX -- need to support SO_PEEK_OFF */
should be removed now I guess.
> @@ -2452,11 +2461,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> }
>
> if ((flags & MSG_PEEK) &&
> - (peek_seq - copied - urg_hole != tp->copied_seq)) {
> + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
> net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
> current->comm,
> task_pid_nr(current));
> - peek_seq = tp->copied_seq;
> + peek_seq = tp->copied_seq + peek_offset;
> }
> continue;
>
> @@ -2497,7 +2506,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> WRITE_ONCE(*seq, *seq + used);
> copied += used;
> len -= used;
> -
> + if (flags & MSG_PEEK)
> + sk_peek_offset_fwd(sk, used);
> + else
> + sk_peek_offset_bwd(sk, used);
> tcp_rcv_space_adjust(sk);
>
> skip_copy:
> @@ -2774,6 +2786,7 @@ void __tcp_close(struct sock *sk, long timeout)
> data_was_unread += len;
> __kfree_skb(skb);
> }
> + sk_set_peek_off(sk, -1);
>
> /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
> if (sk->sk_state == TCP_CLOSE)
> @@ -4492,7 +4505,7 @@ void tcp_done(struct sock *sk)
> reqsk_fastopen_remove(sk, req, false);
>
> WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
> -
> + sk_set_peek_off(sk, -1);
> if (!sock_flag(sk, SOCK_DEAD))
> sk->sk_state_change(sk);
> else
--
Stefano
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC net-next] tcp: add support for SO_PEEK_OFF
2024-02-01 21:32 [RFC net-next] tcp: add support for SO_PEEK_OFF jmaloy
2024-02-01 21:47 ` Stefano Brivio
@ 2024-02-07 4:39 ` David Gibson
1 sibling, 0 replies; 3+ messages in thread
From: David Gibson @ 2024-02-07 4:39 UTC (permalink / raw)
To: jmaloy; +Cc: netdev, davem, kuba, passt-dev, sbrivio, lvivier, dgibson
[-- Attachment #1: Type: text/plain, Size: 10031 bytes --]
On Thu, Feb 01, 2024 at 04:32:01PM -0500, jmaloy@redhat.com wrote:
> From: Jon Maloy <jmaloy@redhat.com>
>
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> in a similar way it is done for Unix Domain sockets.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of 15-20 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF not supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec
> [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec
> [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec
> [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec
> [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec
> [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec
> [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec
> [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec
> [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec
> [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec
> [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@freyr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@freyr:~/passt$
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 52084
> [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec
> [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec
> [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec
> [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec
> [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec
> [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec
> [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec
> [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec
> [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec
> [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec
> [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate
> [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@freyr:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1
> 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Apart from one trivial nit noted below,
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> include/net/tcp.h | 1 +
> net/ipv4/af_inet.c | 1 +
> net/ipv4/tcp.c | 25 +++++++++++++++++++------
> 3 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 87f0e6c2e1f2..7eca7f2ac102 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -357,6 +357,7 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family);
> ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
> struct pipe_inode_info *pipe, size_t len,
> unsigned int flags);
> +int tcp_set_peek_offset(struct sock *sk, int val);
> struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
> bool force_schedule);
>
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index fb81de10d332..7a8b3a91257f 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1068,6 +1068,7 @@ const struct proto_ops inet_stream_ops = {
> #endif
> .splice_eof = inet_splice_eof,
> .splice_read = tcp_splice_read,
> + .set_peek_off = tcp_set_peek_offset,
> .read_sock = tcp_read_sock,
> .read_skb = tcp_read_skb,
> .sendmsg_locked = tcp_sendmsg_locked,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index fce5668a6a3d..33ade88633de 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
> }
> EXPORT_SYMBOL(tcp_splice_read);
>
> +int tcp_set_peek_offset(struct sock *sk, int val)
> +{
> + WRITE_ONCE(sk->sk_peek_off, val);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(tcp_set_peek_offset);
> +
> struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
> bool force_schedule)
> {
> @@ -2302,7 +2310,6 @@ static int tcp_inq_hint(struct sock *sk)
> * tricks with *seq access order and skb->users are not required.
> * Probably, code can be easily improved even more.
> */
> -
Extraneous whitespace change here.
> static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> int flags, struct scm_timestamping_internal *tss,
> int *cmsg_flags)
> @@ -2317,6 +2324,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> long timeo;
> struct sk_buff *skb, *last;
> u32 urg_hole = 0;
> + u32 peek_offset = 0;
>
> err = -ENOTCONN;
> if (sk->sk_state == TCP_LISTEN)
> @@ -2349,7 +2357,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>
> seq = &tp->copied_seq;
> if (flags & MSG_PEEK) {
> - peek_seq = tp->copied_seq;
> + peek_offset = max(sk_peek_offset(sk, flags), 0);
> + peek_seq = tp->copied_seq + peek_offset;
> seq = &peek_seq;
> }
>
> @@ -2452,11 +2461,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> }
>
> if ((flags & MSG_PEEK) &&
> - (peek_seq - copied - urg_hole != tp->copied_seq)) {
> + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
> net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
> current->comm,
> task_pid_nr(current));
> - peek_seq = tp->copied_seq;
> + peek_seq = tp->copied_seq + peek_offset;
> }
> continue;
>
> @@ -2497,7 +2506,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> WRITE_ONCE(*seq, *seq + used);
> copied += used;
> len -= used;
> -
> + if (flags & MSG_PEEK)
> + sk_peek_offset_fwd(sk, used);
> + else
> + sk_peek_offset_bwd(sk, used);
> tcp_rcv_space_adjust(sk);
>
> skip_copy:
> @@ -2774,6 +2786,7 @@ void __tcp_close(struct sock *sk, long timeout)
> data_was_unread += len;
> __kfree_skb(skb);
> }
> + sk_set_peek_off(sk, -1);
>
> /* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
> if (sk->sk_state == TCP_CLOSE)
> @@ -4492,7 +4505,7 @@ void tcp_done(struct sock *sk)
> reqsk_fastopen_remove(sk, req, false);
>
> WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
> -
> + sk_set_peek_off(sk, -1);
> if (!sock_flag(sk, SOCK_DEAD))
> sk->sk_state_change(sk);
> else
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-02-07 4:59 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-01 21:32 [RFC net-next] tcp: add support for SO_PEEK_OFF jmaloy
2024-02-01 21:47 ` Stefano Brivio
2024-02-07 4:39 ` David Gibson
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).