public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com,
	lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com,
	netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF
Date: Tue, 13 Feb 2024 13:24:09 +0100	[thread overview]
Message-ID: <CANn89iJW=nEzVjqxzPht20dUnfqxWGXMO2_EpKUV4JHawBRxfw@mail.gmail.com> (raw)
In-Reply-To: <8d77d8a4e6a37e80aa46cd8df98de84714c384a5.camel@redhat.com>

On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> Oops,
>
> I just noticed Eric is missing from the recipients list, adding him
> now.
>

Hmmm thanks.

> On Fri, 2024-02-09 at 17:12 -0500, jmaloy@redhat.com wrote:
> > From: Jon Maloy <jmaloy@redhat.com>
> >
> > When reading received messages from a socket with MSG_PEEK, we may want
> > to read the contents with an offset, like we can do with pread/preadv()
> > when reading files. Currently, it is not possible to do that.
> >
> > In this commit, we add support for the SO_PEEK_OFF socket option for TCP,
> > in a similar way it is done for Unix Domain sockets.
> >
> > In the iperf3 log examples shown below, we can observe a throughput
> > improvement of 15-20 % in the direction host->namespace when using the
> > protocol splicer 'pasta' (https://passt.top).
> > This is a consistent result.
> >
> > pasta(1) and passt(1) implement user-mode networking for network
> > namespaces (containers) and virtual machines by means of a translation
> > layer between Layer-2 network interface and native Layer-4 sockets
> > (TCP, UDP, ICMP/ICMPv6 echo).
> >
> > Received, pending TCP data to the container/guest is kept in kernel
> > buffers until acknowledged, so the tool routinely needs to fetch new
> > data from socket, skipping data that was already sent.
> >
> > At the moment this is implemented using a dummy buffer passed to
> > recvmsg(). With this change, we don't need a dummy buffer and the
> > related buffer copy (copy_to_user()) anymore.
> >
> > passt and pasta are supported in KubeVirt and libvirt/qemu.
> >
> > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> > SO_PEEK_OFF not supported by kernel.
> >
> > jmaloy@freyr:~/passt# iperf3 -s
> > -----------------------------------------------------------
> > Server listening on 5201 (test #1)
> > -----------------------------------------------------------
> > Accepted connection from 192.168.122.1, port 44822
> > [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  1.02 GBytes  8.78 Gbits/sec
> > [  5]   1.00-2.00   sec  1.06 GBytes  9.08 Gbits/sec
> > [  5]   2.00-3.00   sec  1.07 GBytes  9.15 Gbits/sec
> > [  5]   3.00-4.00   sec  1.10 GBytes  9.46 Gbits/sec
> > [  5]   4.00-5.00   sec  1.03 GBytes  8.85 Gbits/sec
> > [  5]   5.00-6.00   sec  1.10 GBytes  9.44 Gbits/sec
> > [  5]   6.00-7.00   sec  1.11 GBytes  9.56 Gbits/sec
> > [  5]   7.00-8.00   sec  1.07 GBytes  9.20 Gbits/sec
> > [  5]   8.00-9.00   sec   667 MBytes  5.59 Gbits/sec
> > [  5]   9.00-10.00  sec  1.03 GBytes  8.83 Gbits/sec
> > [  5]  10.00-10.04  sec  30.1 MBytes  6.36 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-10.04  sec  10.3 GBytes  8.78 Gbits/sec   receiver
> > -----------------------------------------------------------
> > Server listening on 5201 (test #2)
> > -----------------------------------------------------------
> > ^Ciperf3: interrupt - the server has terminated
> > jmaloy@freyr:~/passt#
> > logout
> > [ perf record: Woken up 23 times to write data ]
> > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ]
> > jmaloy@freyr:~/passt$
> >
> > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f
> > SO_PEEK_OFF supported by kernel.
> >
> > jmaloy@freyr:~/passt# iperf3 -s
> > -----------------------------------------------------------
> > Server listening on 5201 (test #1)
> > -----------------------------------------------------------
> > Accepted connection from 192.168.122.1, port 52084
> > [  5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  1.32 GBytes  11.3 Gbits/sec
> > [  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec
> > [  5]   2.00-3.00   sec  1.26 GBytes  10.8 Gbits/sec
> > [  5]   3.00-4.00   sec  1.36 GBytes  11.7 Gbits/sec
> > [  5]   4.00-5.00   sec  1.33 GBytes  11.4 Gbits/sec
> > [  5]   5.00-6.00   sec  1.21 GBytes  10.4 Gbits/sec
> > [  5]   6.00-7.00   sec  1.31 GBytes  11.2 Gbits/sec
> > [  5]   7.00-8.00   sec  1.25 GBytes  10.7 Gbits/sec
> > [  5]   8.00-9.00   sec  1.33 GBytes  11.5 Gbits/sec
> > [  5]   9.00-10.00  sec  1.24 GBytes  10.7 Gbits/sec
> > [  5]  10.00-10.04  sec  56.0 MBytes  12.1 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-10.04  sec  12.9 GBytes  11.0 Gbits/sec  receiver
> > -----------------------------------------------------------
> > Server listening on 5201 (test #2)
> > -----------------------------------------------------------
> > ^Ciperf3: interrupt - the server has terminated
> > logout
> > [ perf record: Woken up 20 times to write data ]
> > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ]
> > jmaloy@freyr:~/passt$
> >
> > The perf record confirms this result. Below, we can observe that the
> > CPU spends significantly less time in the function ____sys_recvmsg()
> > when we have offset support.
> >
> > Without offset support:
> > ----------------------
> > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> >                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> > 46.32%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> >
> > With offset support:
> > ----------------------
> > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \
> >                        -p ____sys_recvmsg -x --stdio -i  perf.data | head -1
> > 28.12%     0.00%  passt.avx2  [kernel.vmlinux]  [k] do_syscall_64  ____sys_recvmsg
> >
> > Suggested-by: Paolo Abeni <pabeni@redhat.com>
> > Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> >
> > ---
> > v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni
> > ---
> >  net/ipv4/af_inet.c |  1 +
> >  net/ipv4/tcp.c     | 16 ++++++++++------
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index 4e635dd3d3c8..5f0e5d10c416 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = {
> >  #endif
> >       .splice_eof        = inet_splice_eof,
> >       .splice_read       = tcp_splice_read,
> > +     .set_peek_off      = sk_set_peek_off,
> >       .read_sock         = tcp_read_sock,
> >       .read_skb          = tcp_read_skb,
> >       .sendmsg_locked    = tcp_sendmsg_locked,
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 7e2481b9eae1..1c8cab14a32c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
> >       struct sk_buff *skb;
> >       int copied = 0, err = 0;
> >
> > -     /* XXX -- need to support SO_PEEK_OFF */
> > -
> >       skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> >               err = skb_copy_datagram_msg(skb, 0, msg, skb->len);
> >               if (err)
> > @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >       int target;             /* Read at least this many bytes */
> >       long timeo;
> >       struct sk_buff *skb, *last;
> > +     u32 peek_offset = 0;
> >       u32 urg_hole = 0;
> >
> >       err = -ENOTCONN;
> > @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >
> >       seq = &tp->copied_seq;
> >       if (flags & MSG_PEEK) {
> > -             peek_seq = tp->copied_seq;
> > +             peek_offset = max(sk_peek_offset(sk, flags), 0);
> > +             peek_seq = tp->copied_seq + peek_offset;
> >               seq = &peek_seq;
> >       }
> >
> > @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >               }
> >
> >               if ((flags & MSG_PEEK) &&
> > -                 (peek_seq - copied - urg_hole != tp->copied_seq)) {
> > +                 (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) {
> >                       net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n",
> >                                           current->comm,
> >                                           task_pid_nr(current));
> > -                     peek_seq = tp->copied_seq;
> > +                     peek_seq = tp->copied_seq + peek_offset;
> >               }
> >               continue;
> >
> > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
> >               WRITE_ONCE(*seq, *seq + used);
> >               copied += used;
> >               len -= used;
> > -
> > +             if (flags & MSG_PEEK)
> > +                     sk_peek_offset_fwd(sk, used);
> > +             else
> > +                     sk_peek_offset_bwd(sk, used);

Yet another cache miss in TCP fast path...

We need to move sk_peek_off in a better location before we accept this patch.

I always thought MSK_PEEK was very inefficient, I am surprised we
allow arbitrary loops in recvmsg().

  reply	other threads:[~2024-02-13 12:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-09 22:12 [PATCH v3] tcp: add support for SO_PEEK_OFF jmaloy
2024-02-11 23:17 ` Stefano Brivio
2024-02-13 10:49 ` Paolo Abeni
2024-02-13 12:24   ` Eric Dumazet [this message]
2024-02-13 13:02     ` Paolo Abeni
2024-02-13 13:34       ` Eric Dumazet
2024-02-13 15:28         ` Paolo Abeni
2024-02-13 15:49           ` Eric Dumazet
2024-02-13 18:39             ` Paolo Abeni
2024-02-13 19:31               ` Eric Dumazet
     [not found]                 ` <20687849-ec5c-9ce5-0a18-cc80f5b64816@redhat.com>
2024-02-15 17:41                   ` Paolo Abeni
2024-02-15 17:46                     ` Eric Dumazet
2024-02-15 22:24                       ` Jon Maloy
2024-02-16  9:14                         ` Paolo Abeni
2024-02-16  9:21                           ` Eric Dumazet
     [not found]                             ` <6a9f5dec-eb0c-51ef-0911-7345f50e08f0@redhat.com>
2024-02-16 10:55                               ` Eric Dumazet
2024-02-19  2:02                               ` David Gibson
2024-02-13 23:34             ` David Gibson
2024-02-14  3:41               ` Eric Dumazet
2024-02-15  3:16                 ` David Gibson
2024-02-15  3:21               ` David Gibson
2024-02-15  9:15                 ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANn89iJW=nEzVjqxzPht20dUnfqxWGXMO2_EpKUV4JHawBRxfw@mail.gmail.com' \
    --to=edumazet@google.com \
    --cc=davem@davemloft.net \
    --cc=dgibson@redhat.com \
    --cc=jmaloy@redhat.com \
    --cc=kuba@kernel.org \
    --cc=lvivier@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).