From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-ed1-x531.google.com (mail-ed1-x531.google.com [IPv6:2a00:1450:4864:20::531]) by passt.top (Postfix) with ESMTPS id C27405A0279 for ; Tue, 13 Feb 2024 13:24:23 +0100 (CET) Received: by mail-ed1-x531.google.com with SMTP id 4fb4d7f45d1cf-56101dee221so7831a12.1 for ; Tue, 13 Feb 2024 04:24:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707827063; x=1708431863; darn=passt.top; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8ST4HsRTSQGGHe/8wtHnxbc4k3ENKew4mrzP3O6ekKg=; b=YySMRxamfvqL5LTQfnrusMbsOerIfKYJt0YCnHKk2mWyPvfPbsPbU+42OZcDSHiyg9 8KIiOMM5kcmU1I0QZrZ7vFJX4cgx00CJdYHDgfsB5tHsqTt36dKLAtWG9mc3JAeE5aHs 2oGkMw3c5XW8dXJOLkHanR+/Bb7qCjMiPtDFp7omindG37Rg/bRefJUmfIWcCnOjnw7Q ODpoZKI1ALzpoZPu1ZM+2E758IjRn249RyfNRK9rkz0jpCxhHHM6qmy6W6mk7EGAwKWS U/hvF3jrEwQT+gCrBlhi4I5VViz7oXsq8Z8ZqfHl8+0/tUgsGCqSWYupOCeZjE/0FQlg kP2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707827063; x=1708431863; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8ST4HsRTSQGGHe/8wtHnxbc4k3ENKew4mrzP3O6ekKg=; b=f8XEIij+j8KpJmFk5XD2wtzdSwz4178Wfde/xj9zhw2KEqbUmjNh6zULJpDbG29SVv t6KmOO8vYcZI3kqvtAV+iGhdXEgFhXjejtl+vch7DDjb+lAkME4upFR/4SpYnTkKSsH/ E6tCnmgmHK0Jlq3gr3BSNhRaqKiWM7feWXI+kayl/WJHxYZn6UumUUJgk8p+KsImguL1 4+wRYjq3MiJaoLZRpUG9NKEXHgMvSxN0swJATeSfH+HgmZ13GJMaP2zFVlJc/jehXBtu /Hb4Hzbubs9QFS/wJDV8kiDmYJi4Ncs5AcUsmE3hFglvDBhkD+85SSeja5WtoFZq6DBn vzvA== X-Forwarded-Encrypted: i=1; AJvYcCXqlJHzj5SlQNm73PAEqOZ1BT5e6npfp6wuRCSid78l2owvs+aabJDtJVb02azO/vK0s+T1+j/1mb8aRzCviaIyJ1jp X-Gm-Message-State: AOJu0YxOwLDtUgi7qkFP1IP75kgyeLpD5yO6pv6IHXPertyRYISUoemP KIsrHbQsGKnuT5qSd2SaniTxwy/1s3PSnp46WUCZJr7JAEFuUfhMEAGul6z0RHgzqVAaG61h/wi wC7/kBrZZ0WVpUsPmnp9HDBofMUSdhP6REaoI X-Google-Smtp-Source: AGHT+IEEk8i34t4IIaHLuADvQ6bIc/V6sdm5SSDqjOv43NGGebnZi+E8L5kbdmjHcuwBKxGpErfGRmA1rqvH2M0Rh3c= X-Received: by 2002:a50:cc96:0:b0:560:ea86:4d28 with SMTP id q22-20020a50cc96000000b00560ea864d28mr140644edi.4.1707827062901; Tue, 13 Feb 2024 04:24:22 -0800 (PST) MIME-Version: 1.0 References: <20240209221233.3150253-1-jmaloy@redhat.com> <8d77d8a4e6a37e80aa46cd8df98de84714c384a5.camel@redhat.com> In-Reply-To: <8d77d8a4e6a37e80aa46cd8df98de84714c384a5.camel@redhat.com> From: Eric Dumazet Date: Tue, 13 Feb 2024 13:24:09 +0100 Message-ID: Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF To: Paolo Abeni Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-MailFrom: edumazet@google.com X-Mailman-Rule-Hits: nonmember-moderation X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation Message-ID-Hash: WQVTWFNZCQRBXKN7QVY4UKKVCJ2OULW2 X-Message-ID-Hash: WQVTWFNZCQRBXKN7QVY4UKKVCJ2OULW2 X-Mailman-Approved-At: Tue, 13 Feb 2024 14:04:00 +0100 CC: kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com, netdev@vger.kernel.org, davem@davemloft.net X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, Feb 13, 2024 at 11:49=E2=80=AFAM Paolo Abeni wr= ote: > > Oops, > > I just noticed Eric is missing from the recipients list, adding him > now. > Hmmm thanks. > On Fri, 2024-02-09 at 17:12 -0500, jmaloy@redhat.com wrote: > > From: Jon Maloy > > > > When reading received messages from a socket with MSG_PEEK, we may want > > to read the contents with an offset, like we can do with pread/preadv() > > when reading files. Currently, it is not possible to do that. > > > > In this commit, we add support for the SO_PEEK_OFF socket option for TC= P, > > in a similar way it is done for Unix Domain sockets. > > > > In the iperf3 log examples shown below, we can observe a throughput > > improvement of 15-20 % in the direction host->namespace when using the > > protocol splicer 'pasta' (https://passt.top). > > This is a consistent result. > > > > pasta(1) and passt(1) implement user-mode networking for network > > namespaces (containers) and virtual machines by means of a translation > > layer between Layer-2 network interface and native Layer-4 sockets > > (TCP, UDP, ICMP/ICMPv6 echo). > > > > Received, pending TCP data to the container/guest is kept in kernel > > buffers until acknowledged, so the tool routinely needs to fetch new > > data from socket, skipping data that was already sent. > > > > At the moment this is implemented using a dummy buffer passed to > > recvmsg(). With this change, we don't need a dummy buffer and the > > related buffer copy (copy_to_user()) anymore. > > > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > > SO_PEEK_OFF not supported by kernel. > > > > jmaloy@freyr:~/passt# iperf3 -s > > ----------------------------------------------------------- > > Server listening on 5201 (test #1) > > ----------------------------------------------------------- > > Accepted connection from 192.168.122.1, port 44822 > > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 4= 4832 > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > > ----------------------------------------------------------- > > Server listening on 5201 (test #2) > > ----------------------------------------------------------- > > ^Ciperf3: interrupt - the server has terminated > > jmaloy@freyr:~/passt# > > logout > > [ perf record: Woken up 23 times to write data ] > > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > > jmaloy@freyr:~/passt$ > > > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > > SO_PEEK_OFF supported by kernel. > > > > jmaloy@freyr:~/passt# iperf3 -s > > ----------------------------------------------------------- > > Server listening on 5201 (test #1) > > ----------------------------------------------------------- > > Accepted connection from 192.168.122.1, port 52084 > > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 5= 2098 > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate > > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > > ----------------------------------------------------------- > > Server listening on 5201 (test #2) > > ----------------------------------------------------------- > > ^Ciperf3: interrupt - the server has terminated > > logout > > [ perf record: Woken up 20 times to write data ] > > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > > jmaloy@freyr:~/passt$ > > > > The perf record confirms this result. Below, we can observe that the > > CPU spends significantly less time in the function ____sys_recvmsg() > > when we have offset support. > > > > Without offset support: > > ---------------------- > > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=3Ddo_syscall_64 \ > > -p ____sys_recvmsg -x --stdio -i perf.data | he= ad -1 > > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____= sys_recvmsg > > > > With offset support: > > ---------------------- > > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=3Ddo_syscall_64 \ > > -p ____sys_recvmsg -x --stdio -i perf.data | he= ad -1 > > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____= sys_recvmsg > > > > Suggested-by: Paolo Abeni > > Signed-off-by: Jon Maloy > > > > --- > > v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni > > --- > > net/ipv4/af_inet.c | 1 + > > net/ipv4/tcp.c | 16 ++++++++++------ > > 2 files changed, 11 insertions(+), 6 deletions(-) > > > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > > index 4e635dd3d3c8..5f0e5d10c416 100644 > > --- a/net/ipv4/af_inet.c > > +++ b/net/ipv4/af_inet.c > > @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops =3D { > > #endif > > .splice_eof =3D inet_splice_eof, > > .splice_read =3D tcp_splice_read, > > + .set_peek_off =3D sk_set_peek_off, > > .read_sock =3D tcp_read_sock, > > .read_skb =3D tcp_read_skb, > > .sendmsg_locked =3D tcp_sendmsg_locked, > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 7e2481b9eae1..1c8cab14a32c 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct = msghdr *msg, int len) > > struct sk_buff *skb; > > int copied =3D 0, err =3D 0; > > > > - /* XXX -- need to support SO_PEEK_OFF */ > > - > > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) { > > err =3D skb_copy_datagram_msg(skb, 0, msg, skb->len); > > if (err) > > @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, st= ruct msghdr *msg, size_t len, > > int target; /* Read at least this many bytes */ > > long timeo; > > struct sk_buff *skb, *last; > > + u32 peek_offset =3D 0; > > u32 urg_hole =3D 0; > > > > err =3D -ENOTCONN; > > @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, st= ruct msghdr *msg, size_t len, > > > > seq =3D &tp->copied_seq; > > if (flags & MSG_PEEK) { > > - peek_seq =3D tp->copied_seq; > > + peek_offset =3D max(sk_peek_offset(sk, flags), 0); > > + peek_seq =3D tp->copied_seq + peek_offset; > > seq =3D &peek_seq; > > } > > > > @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, = struct msghdr *msg, size_t len, > > } > > > > if ((flags & MSG_PEEK) && > > - (peek_seq - copied - urg_hole !=3D tp->copied_seq)) { > > + (peek_seq - peek_offset - copied - urg_hole !=3D tp->= copied_seq)) { > > net_dbg_ratelimited("TCP(%s:%d): Application bug,= race in MSG_PEEK\n", > > current->comm, > > task_pid_nr(current)); > > - peek_seq =3D tp->copied_seq; > > + peek_seq =3D tp->copied_seq + peek_offset; > > } > > continue; > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, s= truct msghdr *msg, size_t len, > > WRITE_ONCE(*seq, *seq + used); > > copied +=3D used; > > len -=3D used; > > - > > + if (flags & MSG_PEEK) > > + sk_peek_offset_fwd(sk, used); > > + else > > + sk_peek_offset_bwd(sk, used); Yet another cache miss in TCP fast path... We need to move sk_peek_off in a better location before we accept this patc= h. I always thought MSK_PEEK was very inefficient, I am surprised we allow arbitrary loops in recvmsg().