From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 7FA2E5A9CDD for ; Fri, 3 May 2024 15:43:34 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1714743813; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=L2WC38oERgyJr3lJp7s4vd0AudFpB7q4tynUfLfvnuM=; b=VRr0CE65IruV5nDACLlvGQElzGz/80cu2N39K81E/E3x4hD9uaHYLhY6iDHED3ho5Dtr3l IIF5ZpIrWzB31347cOO+aKVPZ9xkfCr+ILAy4JHYKKDdaSSTD0AEBZqAoxLANWccdJgiyk asOq4HJQLLzCkGV15PNOlMr5G2bf3nk= Received: from mail-ej1-f71.google.com (mail-ej1-f71.google.com [209.85.218.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-120-JLThcR48MFOOD4srI_B8jA-1; Fri, 03 May 2024 09:43:31 -0400 X-MC-Unique: JLThcR48MFOOD4srI_B8jA-1 Received: by mail-ej1-f71.google.com with SMTP id a640c23a62f3a-a526a77445bso590612166b.2 for ; Fri, 03 May 2024 06:43:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714743810; x=1715348610; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=L2WC38oERgyJr3lJp7s4vd0AudFpB7q4tynUfLfvnuM=; b=sPgUBx1i+QCILXklKtOVDwZ+q4+S2Q6E5XjhNtHgq9PsEanIOkdMh8KszAaRn6MIts WZP8E5VWEtkQPn/LDL+hLY2F8KdHGL/x7nYi9PIduWw6exrnXQcbpwrEe/aW4mAo0msV co45HDKyfI3gxZxcp0d4a9z/FiAQNcE5Wacbp+oOK8svXeTej0Hj5v2bmvJdTl2tw0ki mIHJbw4hCAduEs+jGc5gaROlyJErm5EVWBLXGYlkmgnnHEcymUJ5BjFciB0IHmQAvld4 4Ol8lhDDr79gTF2hAY72o6Ag78pAwQ1KEESm4rEjEh3AdScJS1kZ7z4x0Q/yMmtAa14K IK2Q== X-Gm-Message-State: AOJu0Yz+zFCnkTEKIVOWTqhEt6+2t83OuR7J0wwcAnZ70FJgIJpk6u/d MI3NRIQ96gMpEv9sjdz3SDJE7Q/gPdzawjihKt2dhCAS9tXSlw62A2Iz6cb8lT9h+rEGQAMtG9A xsEmm8spDfeFYkjoQe/8tj+8PJcD/4dDEG8RUHDCE38g7wnUR4w== X-Received: by 2002:a17:906:c503:b0:a59:9b64:216e with SMTP id bf3-20020a170906c50300b00a599b64216emr724943ejb.46.1714743809810; Fri, 03 May 2024 06:43:29 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHoQ69fVjH0cmZBciqlsVKn/SzqeJcGQSU8/Py9O+ZDRWO3K+LEGYW9NfevnNd2dNQjmCuWCg== X-Received: by 2002:a17:906:c503:b0:a59:9b64:216e with SMTP id bf3-20020a170906c50300b00a599b64216emr724924ejb.46.1714743809184; Fri, 03 May 2024 06:43:29 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id s15-20020a170906168f00b00a5994046f36sm599657ejd.51.2024.05.03.06.43.28 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 May 2024 06:43:28 -0700 (PDT) Date: Fri, 3 May 2024 15:42:55 +0200 From: Stefano Brivio To: David Gibson , Jon Maloy Subject: Re: [PATCH v2 1/2] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: <20240503154255.3d062430@elisabeth> In-Reply-To: References: <20240501202839.3491885-1-jmaloy@redhat.com> <20240501202839.3491885-2-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: P4WUHAYBLEOQJQ2LYLHNGGJWPEWIW4C6 X-Message-ID-Hash: P4WUHAYBLEOQJQ2LYLHNGGJWPEWIW4C6 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 2 May 2024 11:31:52 +1000 David Gibson wrote: > On Wed, May 01, 2024 at 04:28:38PM -0400, Jon Maloy wrote: > > >From linux-6.9.0 the kernel will contain > > commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). > > > > This new feature makes is possible to call recv_msg(MSG_PEEK) and make > > it start reading data from a given offset set by the SO_PEEK_OFF socket > > option. This way, we can avoid repeated reading of already read bytes of > > a received message, hence saving read cycles when forwarding TCP messages > > in the host->name space direction. > > > > In this commit, we add functionality to leverage this feature when > > available, while we fall back to the previous behavior when not. > > > > Measurements with iperf3 shows that throughput increases with 15-20 percent > > in the host->namespace direction when this feature is used. > > > > Signed-off-by: Jon Maloy > > > > --- > > v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. > > - Moved set_peek_offset() to only the locations where the socket is set > > to ESTABLISHED. > > - Removed the per-packet synchronization between sk_peek_off and > > already_sent. Instead only doing it in retransmit situations. > > - The problem I found when trouble shooting the occasionally occurring > > out of synch values between 'already_sent' and 'sk_peek_offset' may > > have deeper implications that we may need to be investigate. > > --- > > tcp.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 55 insertions(+), 3 deletions(-) > > > > diff --git a/tcp.c b/tcp.c > > index 905d26f..535f1a2 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -513,6 +513,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM]; > > static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM]; > > static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM]; > > > > +/* Does the kernel support TCP_PEEK_OFF? */ > > +static bool peek_offset_cap; > > + > > /* sendmsg() to socket */ > > static struct iovec tcp_iov [UIO_MAXIOV]; > > > > @@ -582,6 +585,22 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, > > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > > > > +/** > > + * set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported > > + * @conn Connection struct with reference to socket and flags > > + * @offset Offset in bytes Sorry, my bad: I forgot colons after the variable names in my proposal for this comment: @conn:, @offset:. > > + */ > > +static void set_peek_offset(struct tcp_tap_conn *conn, int offset) > > +{ > > + int s; > > + > > + if (!peek_offset_cap) > > + return; Usually we have one extra newline after an early return between a condition and some other code that's not so closely related. > > + s = conn->sock; > > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > > + perror("Failed to set SO_PEEK_OFF\n"); Don't print to stderr directly, use err(). > > +} > > + > > /** > > * tcp_conn_epoll_events() - epoll events mask for given connection state > > * @events: Current connection events > > @@ -2174,6 +2193,12 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > if (iov_rem) > > iov_sock[fill_bufs].iov_len = iov_rem; > > > > + if (peek_offset_cap) { > > + /* Don't use discard buffer */ > > + mh_sock.msg_iov = &iov_sock[1]; > > + mh_sock.msg_iovlen -= 1; > > + } > > + > > I think this would be a little clearer if moved up to where we > currently initialise mh_sock.msg_iov and iov_sock[0], and make that an > else clause to this if. That would make it more obvious that we have > two different - and mutually exclusive - mechanisms for dealing with > un-acked data in the socket buffer. Not worth a respin on its own, > though. > > > /* Receive into buffers, don't dequeue until acknowledged by guest. */ > > do > > len = recvmsg(s, &mh_sock, MSG_PEEK); > > @@ -2195,7 +2220,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > > return 0; > > } > > > > - sendlen = len - already_sent; > > + sendlen = len; > > + if (!peek_offset_cap) > > + sendlen -= already_sent; > > + > > if (sendlen <= 0) { > > conn_flag(c, conn, STALLED); > > return 0; > > @@ -2365,9 +2393,17 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, > > flow_trace(conn, > > "fast re-transmit, ACK: %u, previous sequence: %u", > > max_ack_seq, conn->seq_to_tap); > > + > > + /* Ensure seq_from_tap isn't updated twice after call */ > > + tcp_l2_data_buf_flush(c); > > tcp_l2_data_buf_flush() was replaced by tcp_payload_flush() in a > recently merged change from Laurent. > > IIUC, this is necessary because otherwise our update to seq_to_tap can ...but Jon's comment refers to seq_from_tap (not seq_to_tap)? I'm confused. > be clobbered from tcp_payload_flush() when we process the > queued-but-not-sent frames. ...how? I don't quite understand the issue here: tcp_payload_flush() updates seq_to_tap once we send the frames, not before, right? > This seems like a correct fix, but not an > optimal one: we're flushing out data we've already determined we're > going to retransmit. Instead, I think we want a different helper that > simply discards the queued frames Don't we always send (within the same epoll_wait() cycle) what we queued? What am I missing? > - I'm thinking maybe we actually > want a helper that's called from both the fast and slow retransmit > paths and handles that. > > Ah, wait, we only want to discard queued frames that belong to this > connection, that's trickier. > > It seems to me this is a pre-existing bug, we just managed to get away > with it previously. I think this is at least one cause of the weirdly > jumping forwarding sequence numbers you observed. So I think we want > to make a patch fixing this that goes before the SO_PEEK_OFF changes. > > > + > > conn->seq_ack_from_tap = max_ack_seq; > > conn->seq_to_tap = max_ack_seq; > > + set_peek_offset(conn, 0); > > tcp_data_from_sock(c, conn); > > + > > + /* Empty queue before any POLL event tries to send it again */ > > + tcp_l2_data_buf_flush(c); > > I'm not clear on what the second flush call is for. The only frames > queued should be those added by the tcp_data_from_sock() just above, > and those should be flushed when we get to tcp_defer_handler() before > we return to the epoll loop. > > > } > > > > if (!iov_i) > > @@ -2459,6 +2495,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, > > conn->seq_ack_to_tap = conn->seq_from_tap; > > > > conn_event(c, conn, ESTABLISHED); > > + set_peek_offset(conn, 0); > > > > /* The client might have sent data already, which we didn't > > * dequeue waiting for SYN,ACK from tap -- check now. > > @@ -2539,6 +2576,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, > > goto reset; > > > > conn_event(c, conn, ESTABLISHED); > > + set_peek_offset(conn, 0); > > The set_peek_offset() could go into conn_event() to avoid the > duplication. Not sure if it's worth it or not. I wouldn't do that in conn_event(), the existing side effects there are kind of expected, but set_peek_offset() isn't so closely related to TCP events I'd say. > > if (th->fin) { > > conn->seq_from_tap++; > > @@ -2705,7 +2743,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, > > struct sockaddr_storage sa; > > socklen_t sl = sizeof(sa); > > union flow *flow; > > - int s; > > + int s = 0; > > > > if (c->no_tcp || !(flow = flow_alloc())) > > return; > > @@ -2767,7 +2805,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) > > flow_dbg(conn, "ACK timeout, retry"); > > conn->retrans++; > > conn->seq_to_tap = conn->seq_ack_from_tap; > > + set_peek_offset(conn, 0); > > + tcp_l2_data_buf_flush(c); > > Uh.. doesn't this flush have to go *before* the seq_to_tap update, for > the reasons discussed above? > > > tcp_data_from_sock(c, conn); > > + tcp_l2_data_buf_flush(c); I don't understand the purpose of these new tcp_l2_data_buf_flush() calls. If they fix a pre-existing issue (but I'm not sure which one), they should be in a different patch. > > tcp_timer_ctl(c, conn); > > } > > } else { > > @@ -3041,7 +3082,8 @@ static void tcp_sock_refill_init(const struct ctx *c) > > */ > > int tcp_init(struct ctx *c) > > { > > - unsigned b; > > + unsigned int b, optv = 0; > > + int s; > > > > for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) > > tc_hash[b] = FLOW_SIDX_NONE; > > @@ -3065,6 +3107,16 @@ int tcp_init(struct ctx *c) > > NS_CALL(tcp_ns_socks_init, c); > > } > > > > + /* Probe for SO_PEEK_OFF support */ > > + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); > > + if (s < 0) { > > + warn("Temporary TCP socket creation failed"); > > + } else { > > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) > > + peek_offset_cap = true; > > + close(s); > > + } > > + debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); > > return 0; > > } > > > -- Stefano