From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 658245A026D for ; Tue, 23 Apr 2024 19:50:56 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1713894655; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=klR0grHMwuYxDrLum0hlzlevQk1nknXnahreR6crJmI=; b=LyNYq4s+1VCuNzq6XLeK5V6iIe+SXj5ntyUzAlTLe+QIIf2xws/5N96MBrbKZxasRitncV f6KQgvMuYYKT03lLaDLe+eexaq4O5cMry5GnmX45tihuJDqkcYAwfCTNkCmGHIy8JT+JJu WNkuofxOM7PrFxiqGp22awrxFylT6d0= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-135-KXWxjvsmNAWSM-tM71In5Q-1; Tue, 23 Apr 2024 13:50:53 -0400 X-MC-Unique: KXWxjvsmNAWSM-tM71In5Q-1 Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-34b40e8482aso1567352f8f.0 for ; Tue, 23 Apr 2024 10:50:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713894651; x=1714499451; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ifJPsE3I7VZx36H24nq5zGNM74a76f5lX29fvyqyauc=; b=eMffmuFyjZTGbvY8jX4y7+8OiFnMkrblcSVCIW7xptCiJDxeKKrbdMBbXtyxUlHLeS 7mSsMfdVcfbMz8CyeZYUVGmQVTnmppcJZI8My6YSE00YfrRWuDy3oef+Nt9OsQk9hUMX gut0hM/W3tRHUDLXHz1th6S8sKccDAv/xKWnGaA+Zo/kfUfnwPAv5ow6yPlNS4L9cFpm 5qOZo4np/iFB1Q+GNXMGIY4sgT6dYIcdoZcJWt7e4irR9InEXHo75vr9Qi0h3Yh/0wOi LljlX5FLO3J1EdyYZX4yeEr2yZZRcfE3jGsIJDpyrpF1CV9DjRC1nCwWksF5WFaMW4bj Q4IA== X-Gm-Message-State: AOJu0YwFAo3EH1owhuzoJtATZObBhbx/agUU4sBA2D9DK9JF6ZA7ATzK PB/I07Z+pm+SYM5o1DcXVr22mTaSor0xWvS3lRXEO1GqXcE4QOKoR/vcYqBDi/4FSnHcTyngwpk quSgUFP2ZpnIutjTQenAs56ExVstC6AWopt1hhaDWW2BhOF3Cp43aU3mj2iQJS58lZj4yBU8J7Q 9t5VoMJ3ewykxWqvMkrToRLbGDM8iAb/fqGBU= X-Received: by 2002:adf:cd90:0:b0:343:78d9:ebad with SMTP id q16-20020adfcd90000000b0034378d9ebadmr12100530wrj.27.1713894650568; Tue, 23 Apr 2024 10:50:50 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE1tjKG8ZUDxr/0U6A3gBlQ5ejbNza8/4icyda1uXHKOTqDrahuMz1CXV0HQlePoQXy4OFcIw== X-Received: by 2002:adf:cd90:0:b0:343:78d9:ebad with SMTP id q16-20020adfcd90000000b0034378d9ebadmr12100496wrj.27.1713894649858; Tue, 23 Apr 2024 10:50:49 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id j8-20020a170906830800b00a51a9eccf2asm7277629ejx.125.2024.04.23.10.50.47 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 23 Apr 2024 10:50:48 -0700 (PDT) Date: Tue, 23 Apr 2024 19:50:10 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: <20240423195010.2b4d5c13@elisabeth> In-Reply-To: <20240420191920.104876-2-jmaloy@redhat.com> References: <20240420191920.104876-1-jmaloy@redhat.com> <20240420191920.104876-2-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: WHPA4KMDYQGNTHJ5UW5LMKWGMEPB5QPC X-Message-ID-Hash: WHPA4KMDYQGNTHJ5UW5LMKWGMEPB5QPC X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Sat, 20 Apr 2024 15:19:19 -0400 Jon Maloy wrote: > The kernel may support recvmsg(MSG_PEEK), starting reading data from a > given offset set by the SO_PEEK_OFF socket option. It would be practical to mention in commit message and in a code comment which kernel commit introduced the feature, that is, your commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option") -- even if it's on net-next, better than nothing (net-next might be rebased but it's quite rare). Also the (forecast, at this point) kernel version where this will be introduced would be nice to have. > This makes it > possible to avoid repeated reading of already read initial bytes of a > received message, hence saving read cycles when forwarding TCP messages > in the host->name space direction. >=20 > In this commit, we add functionality to leverage this feature when availa= ble, > while we fall back to the previous behavior when not. >=20 > Measurements with iperf3 shows that throughput increases with 15-20 perce= nt > in the host->namespace direction when this feature is used. >=20 > Signed-off-by: Jon Maloy > --- > tcp.c | 37 +++++++++++++++++++++++++++++++++++-- > 1 file changed, 35 insertions(+), 2 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 905d26f..95d400a 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -505,6 +505,7 @@ static struct tcp_buf_seq_update tcp6_l2_buf_seq_upda= te[TCP_FRAMES_MEM]; > static unsigned int tcp6_l2_buf_used; > =20 > /* recvmsg()/sendmsg() data for tap */ > +static bool peek_offset_cap =3D false; No need to initialise, it's static. The comment just above referred to tcp_buf_discard and iov_sock ("data"), now you're adding a flag just under it, which is a bit confusing. I would rather leave the original comment for tcp_buf_discard and iov_sock, and add another one, say...: > static char =09=09tcp_buf_discard=09=09[MAX_WINDOW]; > static struct iovec=09iov_sock=09=09[TCP_FRAMES_MEM + 1]; /* Does the kernel support TCP_PEEK_OFF? */ static bool peek_offset_cap; > @@ -582,6 +583,14 @@ static_assert(ARRAY_SIZE(tc_hash) >=3D FLOW_MAX, > int init_sock_pool4=09=09[TCP_SOCK_POOL_SIZE]; > int init_sock_pool6=09=09[TCP_SOCK_POOL_SIZE]; > =20 > +static void set_peek_offset(int s, int offset) A kerneldoc-style function comment would be nice, like all other functions in this file: /** * set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported * @s=09=09Socket * @offset=09Offset in bytes */ > +{ > +=09if (!peek_offset_cap) > +=09=09return; > +=09if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) gcc ('make') says: In function =E2=80=98set_peek_offset=E2=80=99, inlined from =E2=80=98tcp_listen_handler=E2=80=99 at tcp.c:2819:2: tcp.c:592:13: warning: =E2=80=98s=E2=80=99 may be used uninitialized [-Wmay= be-uninitialized] 592 | if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(= offset))) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~= ~~~~~~~~ tcp.c: In function =E2=80=98tcp_listen_handler=E2=80=99: tcp.c:2815:13: note: =E2=80=98s=E2=80=99 was declared here 2815 | int s; | ^ clang-tidy ('make clang-tidy'): /home/sbrivio/passt/tcp.c:2819:2: error: 1st function call argument is an u= ninitialized value [clang-analyzer-core.CallAndMessage,-warnings-as-errors] set_peek_offset(s, 0); ^ ~ /home/sbrivio/passt/tcp.c:2815:2: note: 's' declared without an initial val= ue int s; ^~~~~ /home/sbrivio/passt/tcp.c:2817:6: note: Assuming field 'no_tcp' is 0 if (c->no_tcp || !(flow =3D flow_alloc())) ^~~~~~~~~ /home/sbrivio/passt/tcp.c:2817:6: note: Left side of '||' is false /home/sbrivio/passt/tcp.c:2817:21: note: Assuming 'flow' is non-null if (c->no_tcp || !(flow =3D flow_alloc())) ^~~~ /home/sbrivio/passt/tcp.c:2817:2: note: Taking false branch if (c->no_tcp || !(flow =3D flow_alloc())) ^ /home/sbrivio/passt/tcp.c:2819:2: note: 1st function call argument is an un= initialized value set_peek_offset(s, 0); ^ ~ /home/sbrivio/passt/tcp.c:2819:18: error: variable 's' is uninitialized whe= n used here [clang-diagnostic-uninitialized,-warnings-as-errors] set_peek_offset(s, 0); ^ /home/sbrivio/passt/tcp.c:2815:7: note: initialize the variable 's' to sile= nce this warning int s; ^ =3D 0 and cppcheck ('make cppcheck'): tcp.c:2819:18: error: Uninitialized variable: s [uninitvar] set_peek_offset(s, 0); ^ tcp.c:2817:16: note: Assuming condition is false if (c->no_tcp || !(flow =3D flow_alloc())) ^ tcp.c:2819:18: note: Uninitialized variable: s set_peek_offset(s, 0); ^ make: *** [Makefile:296: cppcheck] Error 1 > +=09=09perror("Failed to set SO_PEEK_OFF\n"); > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection stat= e > * @events:=09Current connection events > @@ -1951,7 +1960,7 @@ static void tcp_conn_from_tap(struct ctx *c, > =09=09if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) > =09=09=09goto cancel; > =09} > - Spurious change. > +=09set_peek_offset(s, 0); Do we really need to initialise it to zero on a new connection? Extra system calls on this path matter for latency of connection establishment. > =09conn =3D &flow->tcp; > =09conn->f.type =3D FLOW_TCP; > =09conn->sock =3D s; > @@ -2174,6 +2183,15 @@ static int tcp_data_from_sock(struct ctx *c, struc= t tcp_tap_conn *conn) > =09if (iov_rem) > =09=09iov_sock[fill_bufs].iov_len =3D iov_rem; > =20 > +=09if (peek_offset_cap) { > +=09=09/* Don't use discard buffer */ > +=09=09mh_sock.msg_iov =3D &iov_sock[1]; > +=09=09mh_sock.msg_iovlen -=3D 1; > + > +=09=09/* Keep kernel sk_peek_off in synch */ While Wiktionary lists "synch" as "Alternative spelling of sync", I would argue that "sync" is much more common and less surprising. > +=09=09set_peek_offset(s, already_sent); Note that there's an early return before this point where already_sent is set to 0. Don't you need to set_peek_offset() also in that case? I guess this would be easier to follow if both assignments of already_sent, earlier in this function, were followed by a set_peek_offset() call. Or do we risk calling it spuriously? > +=09} > + > =09/* Receive into buffers, don't dequeue until acknowledged by guest. *= / > =09do > =09=09len =3D recvmsg(s, &mh_sock, MSG_PEEK); > @@ -2195,7 +2213,9 @@ static int tcp_data_from_sock(struct ctx *c, struct= tcp_tap_conn *conn) > =09=09return 0; > =09} > =20 > -=09sendlen =3D len - already_sent; > +=09sendlen =3D len; > +=09if (!peek_offset_cap) > +=09=09sendlen -=3D already_sent; > =09if (sendlen <=3D 0) { > =09=09conn_flag(c, conn, STALLED); > =09=09return 0; > @@ -2718,6 +2738,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_= ref ref, > =09 tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, > =09=09=09=09 s, (struct sockaddr *)&sa)) > =09=09return; > +=09set_peek_offset(s, 0); Same here, do we really need to initialise it to zero? If yes, it would be nice to leave a blank line after that 'return' as it was before. > =20 > =09tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, > =09=09=09 (struct sockaddr *)&sa, now); > @@ -3042,6 +3063,7 @@ static void tcp_sock_refill_init(const struct ctx *= c) > int tcp_init(struct ctx *c) > { > =09unsigned b; > +=09int s; > =20 > =09for (b =3D 0; b < TCP_HASH_TABLE_SIZE; b++) > =09=09tc_hash[b] =3D FLOW_SIDX_NONE; > @@ -3065,6 +3087,17 @@ int tcp_init(struct ctx *c) > =09=09NS_CALL(tcp_ns_socks_init, c); > =09} > =20 > +=09/* Probe for SO_PEEK_OFF support */ > +=09s =3D socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); By default, we should always pass SOCK_CLOEXEC to socket(), just in case we miss closing one. > +=09if (s < 0) { > +=09=09perror("Temporary tcp socket creation failed\n"); This should be a warn() call, and s/tcp/TCP/. > +=09} else { > +=09=09if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &(int){0}, sizeof(int)= )) { No particular reason to exceed 80 columns here, and curly brackets aren't needed (coding style is the same as kernel). For consistency with the rest of the codebase: &((int) { 0 }). > +=09=09=09peek_offset_cap =3D true; > +=09=09} > +=09=09close(s); > +=09} > +=09printf("SO_PEEK_OFF%ssupported\n", peek_offset_cap ? " " : " not "); This should be a debug() call, a bit too much for info(). > =09return 0; > } > =20 ...I'm still reviewing 2/2, sorry for the delay. --=20 Stefano