public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Jon Maloy <jmaloy@redhat.com>
To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com,
	dgibson@redhat.com
Cc: Paolo Abeni <pabeni@redhat.com>
Subject: Re: [PATCH v2] tcp: add support for SO_PEEK_OFF
Date: Thu, 8 Feb 2024 09:52:27 -0500	[thread overview]
Message-ID: <187f89bd-a49a-cb40-0444-61b19ef1091e@redhat.com> (raw)
In-Reply-To: <20240208145017.2938015-1-jmaloy@redhat.com>

Stefano,
As soon as I get your "Reviewed-by" on this I will send it to netdev.
///jon

On 2024-02-08 09:50, Jon Maloy wrote:
> When reading received messages from a socket with MSG_PEEK, we may want
> to read the contents with an offset, like we can do with pread/preadv()
> when reading files. Currently, it is not possible to do that.
>
> In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> in a similar way it is done for Unix Domain sockets.
>
> In the iperf3 log examples shown below, we can observe a throughput
> improvement of 15-20 % in the direction host->namespace when using the
> protocol splicer 'pasta' (https://passt.top).
> This is a consistent result.
>
> pasta(1) and passt(1) implement user-mode networking for network
> namespaces (containers) and virtual machines by means of a translation
> layer between Layer-2 network interface and native Layer-4 sockets
> (TCP, UDP, ICMP/ICMPv6 echo).
>
> Received, pending TCP data to the container/guest is kept in kernel
> buffers until acknowledged, so the tool routinely needs to fetch new
> data from socket, skipping data that was already sent.
>
> At the moment this is implemented using a dummy buffer passed to
> recvmsg(). With this change, we don't need a dummy buffer and the
> related buffer copy (copy_to_user()) anymore.
>
> passt and pasta are supported in KubeVirt and libvirt/qemu.
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF not supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 44822
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
> [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
> [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
> [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
> [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
> [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
> [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
> [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
> [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
> [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
> [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> jmaloy@freyr:~/passt#
> logout
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> jmaloy@freyr:~/passt$
>
> jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> SO_PEEK_OFF supported by kernel.
>
> jmaloy@freyr:~/passt# iperf3 -s
> -----------------------------------------------------------
> Server listening on 5201 (test #1)
> -----------------------------------------------------------
> Accepted connection from 192.168.122.1, port 52084
> [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
> [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
> [  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
> [  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
> [  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
> [  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
> [  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
> [  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
> [  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
> [  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
> [  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec                  receiver
> -----------------------------------------------------------
> Server listening on 5201 (test #2)
> -----------------------------------------------------------
> ^Ciperf3: interrupt - the server has terminated
> logout
> [ perf record: Woken up 20 times to write data ]
> [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> jmaloy@freyr:~/passt$
>
> The perf record confirms this result. Below, we can observe that the
> CPU spends significantly less time in the function ____sys_recvmsg()
> when we have offset support.
>
> Without offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
>      46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
>
> With offset support:
> ----------------------
> jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
>     28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
>
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Reviewed-by: David Gibson <dgibson@redhat.com>
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>   include/net/tcp.h  |  1 +
>   net/ipv4/af_inet.c |  1 +
>   net/ipv4/tcp.c     | 26 +++++++++++++++++++-------
>   3 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 87f0e6c2e1f2..7eca7f2ac102 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -357,6 +357,7 @@ void tcp_twsk_purge(struct list_head *net_exit_list, int family);
>   ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
>   			struct pipe_inode_info *pipe, size_t len,
>   			unsigned int flags);
> +int tcp_set_peek_offset(struct sock *sk, int val);
>   struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
>   				     bool force_schedule);
>   
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index fb81de10d332..7a8b3a91257f 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1068,6 +1068,7 @@ const struct proto_ops inet_stream_ops = {
>   #endif
>   	.splice_eof	   = inet_splice_eof,
>   	.splice_read	   = tcp_splice_read,
> +	.set_peek_off      = tcp_set_peek_offset,
>   	.read_sock	   = tcp_read_sock,
>   	.read_skb	   = tcp_read_skb,
>   	.sendmsg_locked    = tcp_sendmsg_locked,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index fce5668a6a3d..ad03e0cee3c1 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -863,6 +863,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
>   }
>   EXPORT_SYMBOL(tcp_splice_read);
>   
> +int tcp_set_peek_offset(struct sock *sk, int val)
> +{
> +	WRITE_ONCE(sk->sk_peek_off, val);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(tcp_set_peek_offset);
> +
>   struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
>   				     bool force_schedule)
>   {
> @@ -1414,8 +1422,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
>   	struct sk_buff *skb;
>   	int copied = 0, err = 0;
>   
> -	/* XXX -- need to support SO_PEEK_OFF */
> -
>   	skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
>   		err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
>   		if (err)
> @@ -2317,6 +2323,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>   	long timeo;
>   	struct sk_buff *skb, *last;
>   	u32 urg_hole = 0;
> +	u32 peek_offset = 0;
>   
>   	err = -ENOTCONN;
>   	if (sk->sk_state == TCP_LISTEN)
> @@ -2349,7 +2356,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>   
>   	seq = &tp->copied_seq;
>   	if (flags & MSG_PEEK) {
> -		peek_seq = tp->copied_seq;
> +		peek_offset = max(sk_peek_offset(sk, flags), 0);
> +		peek_seq = tp->copied_seq + peek_offset;
>   		seq = &peek_seq;
>   	}
>   
> @@ -2452,11 +2460,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>   		}
>   
>   		if ((flags & MSG_PEEK) &&
> -		    (peek_seq - copied - urg_hole != tp->copied_seq)) {
> +		    (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
>   			net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
>   					    current->comm,
>   					    task_pid_nr(current));
> -			peek_seq = tp->copied_seq;
> +			peek_seq = tp->copied_seq + peek_offset;
>   		}
>   		continue;
>   
> @@ -2497,7 +2505,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
>   		WRITE_ONCE(*seq, *seq + used);
>   		copied += used;
>   		len -= used;
> -
> +		if (flags & MSG_PEEK)
> +			sk_peek_offset_fwd(sk, used);
> +		else
> +			sk_peek_offset_bwd(sk, used);
>   		tcp_rcv_space_adjust(sk);
>   
>   skip_copy:
> @@ -2774,6 +2785,7 @@ void __tcp_close(struct sock *sk, long timeout)
>   		data_was_unread += len;
>   		__kfree_skb(skb);
>   	}
> +	sk_set_peek_off(sk, -1);
>   
>   	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
>   	if (sk->sk_state == TCP_CLOSE)
> @@ -4492,7 +4504,7 @@ void tcp_done(struct sock *sk)
>   		reqsk_fastopen_remove(sk, req, false);
>   
>   	WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
> -
> +	sk_set_peek_off(sk, -1);
>   	if (!sock_flag(sk, SOCK_DEAD))
>   		sk->sk_state_change(sk);
>   	else


  reply	other threads:[~2024-02-08 14:52 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-08 14:50 [PATCH v2] tcp: add support for SO_PEEK_OFF Jon Maloy
2024-02-08 14:52 ` Jon Maloy [this message]
2024-02-08 15:48 ` Stefano Brivio
2024-02-08 16:05 ` Paolo Abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=187f89bd-a49a-cb40-0444-61b19ef1091e@redhat.com \
    --to=jmaloy@redhat.com \
    --cc=dgibson@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=pabeni@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).