From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 3587E5A026D for ; Mon, 8 Jan 2024 10:16:29 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1704705388; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=RwNyQCGNrK8h9SxfVDjbHobCK+EhPGGrN3TmVIbBy1Y=; b=aUjv3rhTnMOJRSZ5AkNIl1Kx29Nbcs34Wzg6nCmXE+p/vloK1wMJxfe1fR8fW1yg8IliEy CRQUN0vtr2GcdUNjeHP5fJ5gPgSfUl2QyxZRm7sd2FMOI3dweOFcCBwQLW3DidUT67dEP1 IcNJoo2Q1wPLO7RdA44QHW9tVIQEVq8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-354-AqzYaIOhOxak1lNv2Cf_Kw-1; Mon, 08 Jan 2024 04:16:24 -0500 X-MC-Unique: AqzYaIOhOxak1lNv2Cf_Kw-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 80AAF80BEC1 for ; Mon, 8 Jan 2024 09:16:24 +0000 (UTC) Received: from fenrir.redhat.com (unknown [10.22.33.89]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1D39B492BE6; Mon, 8 Jan 2024 09:16:24 +0000 (UTC) From: Jon Maloy To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com Subject: [RFC net-next v3] tcp: add support for read with offset when using MSG_PEEK Date: Mon, 8 Jan 2024 04:16:20 -0500 Message-ID: <20240108091620.786698-1-jmaloy@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.10 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Message-ID-Hash: XDHLC7MM5W5S5RGYQSNVV2VABX3GSG3H X-Message-ID-Hash: XDHLC7MM5W5S5RGYQSNVV2VABX3GSG3H X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: When reading received messages from a socket with MSG_PEEK, we may want to read the contents with an offset, like we can do with pread/preadv() when reading files. Currently, it is not possible to do that. In this commit, we allow the user to set iovec.iov_base in the first vector entry to NULL. This tells the socket to skip the first entry, hence letting the iov_len field of that entry indicate the offset value. This way, there is no need to add any new arguments or flags. In the iperf3 logs examples shown below, we can observe a throughput improvement of ~15 % in the direction host->namespace when using the protocol splicer 'pasta' (https://passt.top). This is a consistent result. 'passt' is a tool used to support VMs in containers, such as KubeVirt, and is also generally supported in libvirt/QEMU since release 9.2 / 7.2. 'pasta' is the pure namespace variety of the same tool. jmaloy@lubu:~/passt/passt$ ../passt//net/tools/perf/perf record -g ./pasta --config-net -f MSG_PEEK with offset not supported by kernel. root@lubu:~/passt/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 44822 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated root@lubu:~/passt/passt# logout [ perf record: Woken up 23 times to write data ] [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] jmaloy@lubu:~/passt/passt$ jmaloy@lubu:~/passt/passt$ /home/jmaloy/passt//net/tools/perf/perf record -g ./pasta --config-net -f MSG_PEEK with offset supported by kernel. root@lubu:~/passt/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 40854 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.22 GBytes 10.5 Gbits/sec [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec [ 5] 2.00-3.00 sec 1.22 GBytes 10.5 Gbits/sec [ 5] 3.00-4.00 sec 1.11 GBytes 9.56 Gbits/sec [ 5] 4.00-5.00 sec 1.20 GBytes 10.3 Gbits/sec [ 5] 5.00-6.00 sec 1.14 GBytes 9.80 Gbits/sec [ 5] 6.00-7.00 sec 1.17 GBytes 10.0 Gbits/sec [ 5] 7.00-8.00 sec 1.12 GBytes 9.61 Gbits/sec [ 5] 8.00-9.00 sec 1.13 GBytes 9.74 Gbits/sec [ 5] 9.00-10.00 sec 1.26 GBytes 10.8 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 11.8 GBytes 10.1 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated logout [ perf record: Woken up 20 times to write data ] [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] jmaloy@lubu:~/passt/passt$ The perf record confirms this result. Below, we can observe that the CPU spends significantly less time in the function ____sys_recvmsg() when we have offset support. Without offset support: ---------------------- jmaloy@lubu:~/passt/net/tools/perf$ ./perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i /home/jmaloy/passt/passt/perf.data | head -1 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg jmaloy@lubu:~$ With offset support: ---------------------- jmaloy@lubu:~/passt/net/tools/perf$ ./perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i /home/jmaloy/passt/passt/perf.data | head -1 27.24% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg jmaloy@lubu:~$ Signed-off-by: Jon Maloy --- v3: Made changes suggested by Stefano Brivio: - Added perf result to commit log - Separated parameter sanity tests from code logics Signed-off-by: Jon Maloy --- net/ipv4/tcp.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ff6838ca2e58..007b56dfc9e0 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + size_t peek_offset; int copied = 0; u32 peek_seq; u32 *seq; @@ -2353,6 +2354,18 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, if (flags & MSG_PEEK) { peek_seq = tp->copied_seq; seq = &peek_seq; + if (!msg->msg_iter.__iov[0].iov_base) { + peek_offset = msg->msg_iter.__iov[0].iov_len; + if (peek_offset >= len || msg->msg_iter.nr_segs <= 1) { + err = -EINVAL; + goto out; + } + msg->msg_iter.__iov = &msg->msg_iter.__iov[1]; + msg->msg_iter.nr_segs -= 1; + msg->msg_iter.count -= peek_offset; + len -= peek_offset; + *seq += peek_offset; + } } target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); -- 2.42.0