From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 5DF2E5A026F for ; Fri, 23 Jun 2023 04:36:57 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1687487816; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mWnjxJ1d2VKsVIhUZOUk/drcBSWOF72shlKFTJ+mlsM=; b=AbjLqfzzw9KH6fxI5nGFIJ4tNO+bEpBgz+y6vLfBLRYVfthp9qTRX2IYqbhGB8jswlLVSd uzDcLJzu5n6hy/McUIE5DWpSKnJs2APUWyW58GGNk2JqhU+BtDizkrEr4HpL7Z835nDtkt dCzA1AV7BF5Xe0LUlzaHCskL6M8WTn8= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-637-CfjyGrciM2-hyFryab3JLQ-1; Thu, 22 Jun 2023 22:36:55 -0400 X-MC-Unique: CfjyGrciM2-hyFryab3JLQ-1 Received: by mail-qv1-f71.google.com with SMTP id 6a1803df08f44-630228ad3f7so2167786d6.0 for ; Thu, 22 Jun 2023 19:36:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687487814; x=1690079814; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mWnjxJ1d2VKsVIhUZOUk/drcBSWOF72shlKFTJ+mlsM=; b=CGtMQOpw+uSzntKZ5qThvy2tN83g6u09ndrrN7MrVs8OnlCRyY601U+qTkZZoHSbSY eMXXe0gtozS45SRCwTH80vAnL4a//8DbWbmnVdCnKhracyExk5Rh7dzWuR013YeFkHU+ Z2lL6CRJ0TIq21VyJR5Ap/F3Pz9828krX8WanyQYEN/PpF4izLuy++lXyhfz6DhobpxN VE358o4GtEgnSng65loKA6zhg5v9mdEJ4pjS5RleApu5pNQzP+gKf0D6T5F902eW0EX0 hNt7He/BuYlbc7+lwYAtXRqbngStINhugjpQbYUhQr+rmGeXsI3XFC6cpS2swzGhV9IP 6CIw== X-Gm-Message-State: AC+VfDxGnP9tBqoFYTIYWQ0ccfhcUJlFJo98cJyEpikcrkH4eURG2Olz 7ui1f2IIbeRmqxlgvU4WtGMusH3Nf2fk5zpqGE5Yjt7NQTlI13YUWeFuV66XXaXfcKFXEdCvZnU HZPV4eLVU+g/O4bNTYw8GgDrRVEYZjviIGecAIv79dg49uMbQFuRc6Qj5itenrrFY2Xr/6w== X-Received: by 2002:a05:6214:d8b:b0:62b:527c:b586 with SMTP id e11-20020a0562140d8b00b0062b527cb586mr27859466qve.32.1687487814738; Thu, 22 Jun 2023 19:36:54 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4DInmHt4hUo6GXCw6DAyeQ92wMio4DVeoYelA/KHiaVTbxkk35JR8C8u2O4WjkLMm0IbvZ+Q== X-Received: by 2002:a05:6214:d8b:b0:62b:527c:b586 with SMTP id e11-20020a0562140d8b00b0062b527cb586mr27859441qve.32.1687487814404; Thu, 22 Jun 2023 19:36:54 -0700 (PDT) Received: from [10.0.0.97] ([24.225.234.80]) by smtp.gmail.com with ESMTPSA id fb9-20020a05622a480900b003f394decd08sm4331611qtb.62.2023.06.22.19.36.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 22 Jun 2023 19:36:54 -0700 (PDT) Message-ID: Date: Thu, 22 Jun 2023 22:36:49 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Subject: Re: [RFC v2] tcp: add support for read with offset when using MSG_PEEK To: passt-dev@passt.top, Stefano Brivio , David Gibson , Laurent Vivier References: <20230623021227.2625490-1-jmaloy@redhat.com> From: Jon Maloy In-Reply-To: <20230623021227.2625490-1-jmaloy@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: KJVB7UR65MWPVG2HZ5XLMKVKDTU4SHCE X-Message-ID-Hash: KJVB7UR65MWPVG2HZ5XLMKVKDTU4SHCE X-MailFrom: jmaloy@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: (Added to distribution list. I found that my first approach was too simplistic, since it only moved the reading area forward in the receive buffer, but continue to fill in iov[0] with the indicated length. This commit exactly what we want: we indicate a NULL pointer in iov[0], but want the actually read bytes to end up in the remaining entries, and also the returned value to indicate the actually read length. I look forward to feedback to this, then I can hopefully post it to the netdev list next week. ///jon On 2023-06-22 22:12, Jon Maloy wrote: > When reading received messages with MSG_PEEK, we sometines have to read > the leading bytes of the stream several times, only to reach the bytes > we really want. This is clearly non-optimal. > > What we would want is something similar to pread/preadv(), but working > even for tcp sockets. At the same time, we obviously don't want to add > any new arguments to the recv/recvmsg() calls. > > In this commit, we allow the user to set iovec.iov_base in the first > vector entry to NULL. This tells the socket to skip the first entry, > hence making the iov_len field of that entry indicate the offset value. > This way, there is no need to add any new arguments. > > This change is simple and non-intrusive, and should be safe addition to > the socket API. We have measured it to give a throughput improvement of > 8-10 % for the protocol splicer 'passst', which is used in KubeVirt > containers. > > Signed-off-by: Jon Maloy > > works with original msghdr > > Signed-off-by: Jon Maloy > --- > net/ipv4/tcp.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 33f559f491c8..1d89337e89b6 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2428,6 +2428,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > struct tcp_sock *tp = tcp_sk(sk); > int copied = 0; > u32 peek_seq; > + u32 peek_offset; > u32 *seq; > unsigned long used; > int err; > @@ -2435,7 +2436,6 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > long timeo; > struct sk_buff *skb, *last; > u32 urg_hole = 0; > - > err = -ENOTCONN; > if (sk->sk_state == TCP_LISTEN) > goto out; > @@ -2469,6 +2469,14 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > if (flags & MSG_PEEK) { > peek_seq = tp->copied_seq; > seq = &peek_seq; > + if (msg->msg_iter.iov[0].iov_base == NULL) { > + peek_offset = msg->msg_iter.iov[0].iov_len; > + msg->msg_iter.iov = &msg->msg_iter.iov[1]; > + msg->msg_iter.nr_segs -= 1; > + msg->msg_iter.count -= peek_offset; > + len -= peek_offset; > + *seq += peek_offset; > + } > } > > target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);