From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 3AD3C5A027E for ; Thu, 15 Feb 2024 04:21:44 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1707967298; bh=gsHdd/rcV85tibN/5M8pMhwJYE+uz0ezc8HK7f8OLgA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=lehy0A4LXEkEon7jc8JG1icPCNZjeGT3rnyNYXV+ipdrl6zAPHelO2mQUiKDhDFwY eSiSas7g9JsDaDD/Q9SSo4OCu8AXE6IcGrenq8BodJoFeQAL6TO4BUWVV0ySkRGbyy HNEo72k9Ly5g/pHdjqlYYjxtYDMjiuwp4lP9Bhr2lV8Zp2qIVWwsVoEYE7JsMu9fX2 TTdxCS4KGgijB2VuZ2gteeOU+wsY3dtm9EZ4pwnBUx8LDVKwEoqXPS5yaVMPwBy51b 95uunyWBaJRQwWWaHNKG6q4vwV5uXZyZI/3aZgeiVN1p7qWvfQNBd34B25xmqF6TMd ErnamKTiNuCGA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Tb0hy4KJ3z4wcN; Thu, 15 Feb 2024 14:21:38 +1100 (AEDT) Date: Thu, 15 Feb 2024 14:21:34 +1100 From: David Gibson To: Eric Dumazet Subject: Re: [PATCH v3] tcp: add support for SO_PEEK_OFF Message-ID: References: <20240209221233.3150253-1-jmaloy@redhat.com> <8d77d8a4e6a37e80aa46cd8df98de84714c384a5.camel@redhat.com> <20072ba530b34729589a3d527c420a766b49e205.camel@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="iyh3BKSr+QL7iMHm" Content-Disposition: inline In-Reply-To: Message-ID-Hash: KEI4U2MNILSGXFSFRUTT2PLJONAGRXCK X-Message-ID-Hash: KEI4U2MNILSGXFSFRUTT2PLJONAGRXCK X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Paolo Abeni , kuba@kernel.org, passt-dev@passt.top, sbrivio@redhat.com, lvivier@redhat.com, dgibson@redhat.com, jmaloy@redhat.com, netdev@vger.kernel.org, davem@davemloft.net X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --iyh3BKSr+QL7iMHm Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Feb 14, 2024 at 10:34:50AM +1100, David Gibson wrote: > On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote: > > On Tue, Feb 13, 2024 at 4:28=E2=80=AFPM Paolo Abeni = wrote: > > > > > > On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > > > > On Tue, Feb 13, 2024 at 2:02=E2=80=AFPM Paolo Abeni wrote: > > > > > > > > > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > > > > > > On Tue, Feb 13, 2024 at 11:49=E2=80=AFAM Paolo Abeni wrote: > > > > > > > > > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct= sock *sk, struct msghdr *msg, size_t len, > > > > > > > > WRITE_ONCE(*seq, *seq + used); > > > > > > > > copied +=3D used; > > > > > > > > len -=3D used; > > > > > > > > - > > > > > > > > + if (flags & MSG_PEEK) > > > > > > > > + sk_peek_offset_fwd(sk, used); > > > > > > > > + else > > > > > > > > + sk_peek_offset_bwd(sk, used); > > > > > > > > > > > > Yet another cache miss in TCP fast path... > > > > > > > > > > > > We need to move sk_peek_off in a better location before we acce= pt this patch. > > > > > > > > > > > > I always thought MSK_PEEK was very inefficient, I am surprised = we > > > > > > allow arbitrary loops in recvmsg(). > > > > > > > > > > Let me double check I read the above correctly: are you concerned= by > > > > > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could > > > > > touch a lot of skbs/cachelines before reaching the relevant skb? > > > > > > > > > > The end goal here is allowing an user-space application to read > > > > > incrementally/sequentially the received data while leaving them in > > > > > receive buffer. > > > > > > > > > > I don't see a better option than MSG_PEEK, am I missing something? > > > > > > > > > > > > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the = non > > > > MSG_PEEK case is very strange IMO. > > > > > > > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > > > > is used by the caller. > > > > > > > > That would only touch non fast paths. > > > > > > > > Since the API is mono-threaded anyway, the caller should not rely on > > > > the fact that normal recvmsg() call > > > > would 'consume' sk_peek_offset. > > > > > > Storing in sk_peek_seq the tcp next sequence number to be peeked shou= ld > > > avoid changes in the non MSG_PEEK cases. > > > > > > AFAICS that would need a new get_peek_off() sock_op and a bit somewhe= re > > > (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would > > > that be acceptable? > >=20 > > We could have a parallel SO_PEEK_OFFSET option, reusing the same socket= field. > >=20 > > The new semantic would be : Supported by TCP (so far), and tcp > > recvmsg() only reads/writes this field when MSG_PEEK is used. > > Applications would have to clear the values themselves. >=20 > Those semantics would likely defeat the purpose of using SO_PEEK_OFF > for our use case, since we'd need an additional setsockopt() for every > non-PEEK recv() (which are all MSG_TRUNC in our case). Btw, Eric, If you're concerned about the extra access added to the "regular" TCP path, would you be happier with the original approach Jon proposed: that allowed a user to essentially supply an offset to an individial MSG_PEEK recvmsg() by inserting a dummy entry as msg_iov[0] with a NULL pointer and length to skip. It did the job for us, although I admit it's a little ugly, which I presume is why Paolo suggested we investigate SO_PEEK_OFF instead. I think the SO_PEEK_OFF approach is more elegant, but maybe the performance impact outweighs that. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --iyh3BKSr+QL7iMHm Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmXNgz0ACgkQzQJF27ox 2Ge9HxAAin8tsw9AkMt3hy/JhIPU4kQASeafgOA7dXeRnG9VnGg6xjFS+Ym92OJb rO6sa2hpvxY1zM4mGRfdhqvw7WJUbkXLO6yhGPi1DSipN/7k1N1UWcN6c/R8b+4c z6u3R2kXZqkmCn9HGibGK1Gk2L3WkkLnKa4JwgRj9mygzsoRMOsrIBRnzWFPEEM6 HjcLzQd1k/puZ7XKHrzOPRJS3Hy2pVnPY2trSeUrH7w0lP1niWEaNOlmDDw84Ts5 BrqStIC1dBBt38zBzkeGnivUgAd4Unlyf68fBZh2pMBmRmi3bDBRZaWKYmCEgW/o nZTd6xgxp3omM45wKH0Tfrxq54O0w0wSsz7tRY9dGCo8spcQYuqegCl8HjscC4xj g1NRK/iY5BrBgHGfiiBS2AZXFDyqFg4YU0d27rSNA1Rdc0Ftnw2asImS9OBH8rcQ 9c8LEhd295YfgQWG6i1/8/TGq0xdDw7DmY5Pe8FUSjRduv9P3O6n2lrGWZvJycig ABbzPJeNkBaFp0iHuhBp2CkcHgMahk12goz84M/zx6eY/cg5sexr4JjdQokUNgFv bpsI/FyPgMypjYGWCF3AtkOwtu/5iSu08kBD/EzdJRF55fDpn41dkAaYnP/18TrT uct3QQwc6+1CQeVI3UfQINmWhYHr7OfE751rukIyLNL+pyHXvSY= =RbtH -----END PGP SIGNATURE----- --iyh3BKSr+QL7iMHm--