From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 01DA35A026F for ; Thu, 11 Jan 2024 14:28:08 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1704979687; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SVL8EyJsjj5Y12SN4MgWOUi+iJPMDEen9la71DUdf4k=; b=YxNzycwEizIqz/vvGdrc1yxOsgdt0U6ICq1BIEtHOOC6TlToTLAXsAUJzsOc9ABkmQP0Ht sAUSnbfT+Sxemw3zMBuj7BpQ2Yeht8s3PQ8pA0xksKiYKODw1rHzhWcKNzgEQYC+KTGsTg WF1cDng8Kpk6wjbvznTrw2zQznlFPGc= Received: from mail-ot1-f71.google.com (mail-ot1-f71.google.com [209.85.210.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-639-FHe2petoMbSdhBPWNjYfng-1; Thu, 11 Jan 2024 08:28:06 -0500 X-MC-Unique: FHe2petoMbSdhBPWNjYfng-1 Received: by mail-ot1-f71.google.com with SMTP id 46e09a7af769-6de423c4a27so1744066a34.0 for ; Thu, 11 Jan 2024 05:28:06 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704979686; x=1705584486; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SVL8EyJsjj5Y12SN4MgWOUi+iJPMDEen9la71DUdf4k=; b=AP35Ghxb/ewPsAoJXKN7G/O8djSkIf09vz7WwJpDI5p3RfDG//RbxWz2bTnGOt1PK3 pISVqDsnTxXfPnIRd9xwEfpzX3N7A8g8/tHxzqd6HchIIUtdKDiJsYSmi7cQOR89qf4C CLlf9srIQbFL24pEdeuEbjn9iiL2wTC7oCajgnvsa4dkdUTSVKVuzVsIxrbmN5kubgiK PMtbi8wK4VVVEx6M7lcQ3/O3OxfLGSGheD1D7qx364A68xZxiVjehS15zpDpUa8+JFtl 7aJYS4A2GZpXFVnBi0fR2L9ocuGw+emqyG7sXMNNEND36lPczRFSI1UGlVC/SJBXeBwK ZijA== X-Gm-Message-State: AOJu0YzCfl4cr68hHOnrYkggJtUaaTAd5fwNLj228TqxLKVC9uqpjWen pJGWjW76t80KNyqaxgrS5Hfq6sbwEiSaNIFYlO3NaTx4Sggvvbnq4rTyU/Ay5ZBoQupH/Ovur1S 6En/QdTkOcqliEkfOsjY8ETSqip+/MfV3xVp4gzSGDvmz3mCOvpJRbCdIX87XU5SlrsH9w3S/Nt 8w5OtF X-Received: by 2002:a05:6830:1213:b0:6d9:65f4:3476 with SMTP id r19-20020a056830121300b006d965f43476mr1445000otp.45.1704979686033; Thu, 11 Jan 2024 05:28:06 -0800 (PST) X-Google-Smtp-Source: AGHT+IEUDtsynoyovEo75r/kGY/ceR5MuccGX8KurVoiRcIsaArLKqpzFHcvN5GRSoOqiPwPSZA6RQ== X-Received: by 2002:a05:6830:1213:b0:6d9:65f4:3476 with SMTP id r19-20020a056830121300b006d965f43476mr1444978otp.45.1704979685411; Thu, 11 Jan 2024 05:28:05 -0800 (PST) Received: from [10.0.0.97] ([24.225.234.80]) by smtp.gmail.com with ESMTPSA id i12-20020a05622a08cc00b00429bc01acc5sm422495qte.68.2024.01.11.05.28.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 11 Jan 2024 05:28:05 -0800 (PST) Message-ID: <43f07495-f182-aa56-46eb-7ec44667b5c4@redhat.com> Date: Thu, 11 Jan 2024 08:28:04 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: [RFC net-next v4] tcp: add support for read with offset when using MSG_PEEK To: passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com References: <20240111131917.28741-1-jmaloy@redhat.com> From: Jon Maloy In-Reply-To: <20240111131917.28741-1-jmaloy@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: 6TVUUJROEQXW47MRIKWF22VBGXDSJYBZ X-Message-ID-Hash: 6TVUUJROEQXW47MRIKWF22VBGXDSJYBZ X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 2024-01-11 08:19, Jon Maloy wrote: > When reading received messages from a socket with MSG_PEEK, we may want > to read the contents with an offset, like we can do with pread/preadv() > when reading files. Currently, it is not possible to do that. > > In this commit, we allow the user to set iovec.iov_base in the first > vector entry to NULL. This tells the socket to skip the first entry, > hence letting the iov_len field of that entry indicate the offset value. > This way, there is no need to add any new arguments or flags. > > In the iperf3 log examples shown below, we can observe a throughput > improvement of ~15 % in the direction host->namespace when using the > protocol splicer 'pasta' (https://passt.top). > This is a consistent result. > > pasta(1) and passt(1) implement user-mode networking for network > namespaces (containers) and virtual machines by means of a translation > layer between Layer-2 network interface and native Layer-4 sockets > (TCP, UDP, ICMP/ICMPv6 echo). > > Received, pending TCP data to the container/guest is kept in kernel > buffers until acknowledged, so the tool routinely needs to fetch new > data from socket, skipping data that was already sent. > > At the moment this is implemented using a dummy buffer passed to > recvmsg(). With this change, we don't need a dummy buffer and the > related buffer copy (copy_to_user()) anymore. > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f > MSG_PEEK with offset not supported by kernel. > > jmaloy@lubu:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 44822 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > jmaloy@lubu:~/passt# > logout > [ perf record: Woken up 23 times to write data ] > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > jmaloy@lubu:~/passt$ > > jmaloy@lubu:~/passt$ perf record -g ./pasta --config-net -f > MSG_PEEK with offset supported by kernel. > > jmaloy@lubu:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 40854 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 40862 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.22 GBytes 10.5 Gbits/sec > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > [ 5] 2.00-3.00 sec 1.22 GBytes 10.5 Gbits/sec > [ 5] 3.00-4.00 sec 1.11 GBytes 9.56 Gbits/sec > [ 5] 4.00-5.00 sec 1.20 GBytes 10.3 Gbits/sec > [ 5] 5.00-6.00 sec 1.14 GBytes 9.80 Gbits/sec > [ 5] 6.00-7.00 sec 1.17 GBytes 10.0 Gbits/sec > [ 5] 7.00-8.00 sec 1.12 GBytes 9.61 Gbits/sec > [ 5] 8.00-9.00 sec 1.13 GBytes 9.74 Gbits/sec > [ 5] 9.00-10.00 sec 1.26 GBytes 10.8 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 11.8 GBytes 10.1 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > logout > [ perf record: Woken up 20 times to write data ] > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > jmaloy@lubu:~/passt$ > > The perf record confirms this result. Below, we can observe that the > CPU spends significantly less time in the function ____sys_recvmsg() > when we have offset support. > > Without offset support: > ---------------------- > jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > With offset support: > ---------------------- > jmaloy@lubu:~/passt$ perf report -q --symbol-filter=do_syscall_64 -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 27.24% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > Signed-off-by: Jon Maloy > > --- > v3: Made changes suggested by Stefano Brivio: > - Added perf result to commit log > - Separated parameter sanity tests from code logics > > v4: - Simplified sanity test further. > 'iov_iter.count' is caclulated as the sum of the segment sizes in > ___sys_recvmsg()->recvmsg_copy_msghdr()->copy_msghdr_from_user()->__import_iovec()->iov_iter_init() > 'len' is the same as iov_iter.count, as returnrd in > sock_recvmsg_nosec()->msg_data_left()->iov_iter_count() > Hence, iov[0].iov_len cannot be larger than any of those, and no additional > testing is necessary. > - Improved description of passt/pasta in commit log > - Some cosmetic changes to the iperf3/perf output Hi Stefano, I think we are very close now. If I can get yourack on this today I will send it to net-next, and then we'll see... ///jon > --- > net/ipv4/tcp.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index ff6838ca2e58..50dc997b82f9 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2353,6 +2353,20 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > if (flags & MSG_PEEK) { > peek_seq = tp->copied_seq; > seq = &peek_seq; > + if (!msg->msg_iter.__iov[0].iov_base) { > + size_t peek_offset; > + > + if (msg->msg_iter.nr_segs < 2) { > + err = -EINVAL; > + goto out; > + } > + peek_offset = msg->msg_iter.__iov[0].iov_len; > + msg->msg_iter.__iov = &msg->msg_iter.__iov[1]; > + msg->msg_iter.nr_segs -= 1; > + msg->msg_iter.count -= peek_offset; > + len -= peek_offset; > + *seq += peek_offset; > + } > } > > target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);