From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 4110E5A026D for ; Wed, 6 Dec 2023 00:02:03 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1701817322; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=UDtcm2npTcE/nQhKktfWYjUShlbJqSmRqITVKaysWtg=; b=J0E4gAnoWCD2Ud41Yc8VzQi7d7PjTDr4AHg465RokeIFfQRDO/GDHtuMVes4QNeo4WxvBB eYfJmEv15lOKmBSRCRRbkv7aa36vRn30jm+AM1cVhmK2Xj+CC020NRYZKnBrhDz8m44vb6 s4Z6TLXfj2kKHCCzyAgkjOpOj9Y0JV8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-177-en9CPxO2PYadYGq8xBJMAw-1; Tue, 05 Dec 2023 18:02:00 -0500 X-MC-Unique: en9CPxO2PYadYGq8xBJMAw-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5AD57101A555 for ; Tue, 5 Dec 2023 23:02:00 +0000 (UTC) Received: from fenrir.redhat.com (unknown [10.2.20.42]) by smtp.corp.redhat.com (Postfix) with ESMTP id 181BE3C25; Tue, 5 Dec 2023 23:02:00 +0000 (UTC) From: Jon Maloy To: sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com, passt-dev@passt.top Subject: tcp: add support for read with offset when using MSG_PEEK Date: Tue, 5 Dec 2023 18:01:52 -0500 Message-Id: <20231205230152.1490012-1-jmaloy@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.1 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Message-ID-Hash: RGTCT5WCLO6L6GVYLKTVEJ2EMPTHFKAU X-Message-ID-Hash: RGTCT5WCLO6L6GVYLKTVEJ2EMPTHFKAU X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: When reading received messages with MSG_PEEK, we sometines have to read the leading bytes of the stream several times, only to reach the bytes we really want. This is clearly non-optimal. What we would want is something similar to pread/preadv(), but working even for tcp sockets. At the same time, we don't want to add any new arguments to the recv/recvmsg() calls. In this commit, we allow the user to set iovec.iov_base in the first vector entry to NULL. This tells the socket to skip the first entry, hence letting the iov_len field of that entry indicate the offset value. This way, there is no need to add any new arguments or flags. In the iperf3 logs examples shown below, we can observe a throughput improvement of ~20 % in the direction host->namespace when using the protocol splicer 'passt'. This is a consistent result. $ ./passt/passt/pasta --config-net -f MSG_PEEK with offset not supported. [root@fedora37 ~]# perf record iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 60344 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 60360 [ ID] Interval Transfer Bitrate {...] [ 6] 13.00-14.00 sec 2.54 GBytes 21.8 Gbits/sec [ 6] 14.00-15.00 sec 2.52 GBytes 21.7 Gbits/sec [ 6] 15.00-16.00 sec 2.50 GBytes 21.5 Gbits/sec [ 6] 16.00-17.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 17.00-18.00 sec 2.51 GBytes 21.6 Gbits/sec [ 6] 18.00-19.00 sec 2.48 GBytes 21.3 Gbits/sec [ 6] 19.00-20.00 sec 2.49 GBytes 21.4 Gbits/sec [ 6] 20.00-20.04 sec 87.4 MBytes 19.2 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 48.9 GBytes 21.0 Gbits/sec receiver ----------------------------------------------------------- [jmaloy@fedora37 ~]$ ./passt/passt/pasta --config-net -f MSG_PEEK with offset supported. [root@fedora37 ~]# perf record iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 46362 [ 6] local 192.168.122.163 port 5201 connected to 192.168.122.1 port 46374 [ ID] Interval Transfer Bitrate [...] [ 6] 12.00-13.00 sec 3.18 GBytes 27.3 Gbits/sec [ 6] 13.00-14.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 14.00-15.00 sec 3.13 GBytes 26.9 Gbits/sec [ 6] 15.00-16.00 sec 3.17 GBytes 27.3 Gbits/sec [ 6] 16.00-17.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 17.00-18.00 sec 3.14 GBytes 27.0 Gbits/sec [ 6] 18.00-19.00 sec 3.17 GBytes 27.2 Gbits/sec [ 6] 19.00-20.00 sec 3.12 GBytes 26.8 Gbits/sec [ 6] 20.00-20.04 sec 119 MBytes 25.5 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 6] 0.00-20.04 sec 59.4 GBytes 25.4 Gbits/sec receiver ----------------------------------------------------------- Passt is used to support VMs in containers, such as KubeVirt, and is also generally supported in libvirt/QEMU since release 9.2 / 7.2. Signed-off-by: Jon Maloy Signed-off-by: Jon Paul Maloy --- net/ipv4/tcp.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 53bcc17c91e4..e9d3b5bf2f66 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2310,6 +2310,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + size_t peek_offset; int copied = 0; u32 peek_seq; u32 *seq; @@ -2353,6 +2354,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, if (flags & MSG_PEEK) { peek_seq = tp->copied_seq; seq = &peek_seq; + if (!msg->msg_iter.__iov[0].iov_base) { + peek_offset = msg->msg_iter.__iov[0].iov_len; + msg->msg_iter.__iov = &msg->msg_iter.__iov[1]; + if (msg->msg_iter.nr_segs <= 1) + goto out; + msg->msg_iter.nr_segs -= 1; + if (msg->msg_iter.count <= peek_offset) + goto out; + msg->msg_iter.count -= peek_offset; + if (len <= peek_offset) + goto out; + len -= peek_offset; + *seq += peek_offset; + } } target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); -- 2.39.2