From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) by passt.top (Postfix) with ESMTPS id 903935A0272 for ; Mon, 8 Apr 2024 11:46:47 +0200 (CEST) Received: by mail-ed1-x52d.google.com with SMTP id 4fb4d7f45d1cf-56e67402a3fso4059a12.0 for ; Mon, 08 Apr 2024 02:46:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1712569607; x=1713174407; darn=passt.top; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VGP2IU75M5nh28mXSV6VjeFJCN3D3Th2NbgMlVoU5vw=; b=vbWzZ4DDOnEiF3QSr9/mbEKDqBC/3fR/h0qXioJ98nqU5VFDlr0H5kj+s4QLZOBHML BKUrO+KnL4IQCfgwCPB9Xe7fB6UYZwwA2AhbOwGR4uYrB648MHSa6h8HoPHBqIZWoxQx dnt7cHUcQqI+y6PPd602NcT3vyYfQeKm7Nd8FWX7FCUKFsvK/WFkyJWcRwaasrFk2xRu v7Pa/Tf+woBcXdV5X7P4XZNvWDsRxXaIi7v2r4I4JYVY0jS1/unRuR0pnazPmEXwj/C2 MBcbaeZPnuM/t6vDu6899gYeZuvS05aVUEFq4nTLALGDHaH7ta/qRVlhkIKB6wcjXwZ/ ZdWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712569607; x=1713174407; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VGP2IU75M5nh28mXSV6VjeFJCN3D3Th2NbgMlVoU5vw=; b=NGKmZLD0PCzVQFF89vW/NElI9JX97sMmSZ3ygpW8WsplUqqJETXZRI7eDWy594Wk52 oRSpzegk+wZzkB3rbiczIL8SNa20FFgVgfmhVr9fXeI2zkTEMstD9ZMX3CveluTtWYp2 zuhTeNdq1T+r/RQgIh9CYBfYdnAZJG38IAtUfatmN8moEeM3XCEKYkNDnmJUyXXq1GRk XdtGFRcuNPTDzEsswHQiwsJ5JnXSxhvthaGJihRUO8MtPBTqxW3m2Rx+oRJ3riHOSrqg RoQQlRgFt/FDCvXjPkiH2cUYpNw17LC7mfm/59WAFKj3/POZt7CMEsFPUJxSnFobfqtu tu+A== X-Forwarded-Encrypted: i=1; AJvYcCUriijVKf5+W8ABrC8z7OhhWg6rEi05qe0Htw/qZomhDtCVP8geM2/AS6b1BYK1jjxfU/M4a+nks0UycceVOBZczT0g X-Gm-Message-State: AOJu0YyHJQ47CW0wVCev2bgQSeGhmJ+/XNdBJqdqKwVtxK+K1sgoJJX9 tRLeTrhdkwPruI6g0rb9acX3MkeVJGytAEOwM7ZtjopjyCd8JoWHK3XLBX7AOfFYpSci+z2Fe6+ UzWq3SJD0lioPQY/1x9pVdAMfBTPj9nZd4Poh X-Google-Smtp-Source: AGHT+IGI65Q43I6NQGvN3ryNNw/lkUDBMTa2P9MveH73rk9cCc4d23jlyFCiYaTbqiLSIFs7jGPfAcmA03t5opYlgZ4= X-Received: by 2002:a05:6402:3591:b0:56e:5c0a:8711 with SMTP id y17-20020a056402359100b0056e5c0a8711mr106850edc.3.1712569606856; Mon, 08 Apr 2024 02:46:46 -0700 (PDT) MIME-Version: 1.0 References: <20240406182107.261472-1-jmaloy@redhat.com> <20240406182107.261472-2-jmaloy@redhat.com> In-Reply-To: <20240406182107.261472-2-jmaloy@redhat.com> From: Eric Dumazet Date: Mon, 8 Apr 2024 11:46:34 +0200 Message-ID: Subject: Re: [net-next 1/2] tcp: add support for SO_PEEK_OFF socket option To: jmaloy@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: edumazet@google.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: YRL3IM7V5EJX2IWRTQ5ALZLXKYXLLOIT X-Message-ID-Hash: YRL3IM7V5EJX2IWRTQ5ALZLXKYXLLOIT X-Mailman-Approved-At: Mon, 08 Apr 2024 11:51:32 +0200 CC: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, eric.dumazet@gmail.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Sat, Apr 6, 2024 at 8:21=E2=80=AFPM wrote: > > From: Jon Maloy > > When reading received messages from a socket with MSG_PEEK, we may want > to read the contents with an offset, like we can do with pread/preadv() > when reading files. Currently, it is not possible to do that. > > In this commit, we add support for the SO_PEEK_OFF socket option for TCP, > in a similar way it is done for Unix Domain sockets. > > In the iperf3 log examples shown below, we can observe a throughput > improvement of 15-20 % in the direction host->namespace when using the > protocol splicer 'pasta' (https://passt.top). > This is a consistent result. > > pasta(1) and passt(1) implement user-mode networking for network > namespaces (containers) and virtual machines by means of a translation > layer between Layer-2 network interface and native Layer-4 sockets > (TCP, UDP, ICMP/ICMPv6 echo). > > Received, pending TCP data to the container/guest is kept in kernel > buffers until acknowledged, so the tool routinely needs to fetch new > data from socket, skipping data that was already sent. > > At the moment this is implemented using a dummy buffer passed to > recvmsg(). With this change, we don't need a dummy buffer and the > related buffer copy (copy_to_user()) anymore. > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > j > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 52084 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 520= 98 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > logout > [ perf record: Woken up 20 times to write data ] > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > jmaloy@freyr:~/passt$ > > The perf record confirms this result. Below, we can observe that the > CPU spends significantly less time in the function ____sys_recvmsg() > when we have offset support. > > Without offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=3Ddo_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head= -1 > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sy= s_recvmsg > > With offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=3Ddo_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head= -1 > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sy= s_recvmsg > > Suggested-by: Paolo Abeni > Reviewed-by: Stefano Brivio > Signed-off-by: Jon Maloy > > --- > v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni > v4: - Same as v3. Posting was delayed because I first had to debug > an issue that turned out to not be directly related to this > change. See next commit in this series. This other issue is orthogonal, and might take more time. SO_RCVLOWAT had a similar issue, please take a look at what we did there. If you need SO_PEEK_OFF support, I would suggest you submit this patch as a standalone one. Reviewed-by: Eric Dumazet Thanks.