From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 784DD5A02E0 for ; Tue, 14 May 2024 19:23:09 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715707388; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sK1nd87OVLRDZoigNtAQFwl/Aaq9wl9DN7LB7ebUSvA=; b=ddsLWzgBVpqdpmJjASuAc9OnGKGqbgI/mk0IzG2QqE1ioSnHP+J3L/HTLD8Q+4ay6BcO7N 48KAu3Xc8qH741vnFpKU7pVpY7jNmjkXJW0s8NZH5/3Ffuhk0tntnkWe179oeFINJQwuba ZxRW5cS7/69J1vO6ywz+dsrBsZgRrfI= Received: from mail-lf1-f69.google.com (mail-lf1-f69.google.com [209.85.167.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-691-v3vIqJ_YPCKPLVf9CFikdA-1; Tue, 14 May 2024 13:22:55 -0400 X-MC-Unique: v3vIqJ_YPCKPLVf9CFikdA-1 Received: by mail-lf1-f69.google.com with SMTP id 2adb3069b0e04-52389b09bb6so141503e87.1 for ; Tue, 14 May 2024 10:22:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715707373; x=1716312173; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=sK1nd87OVLRDZoigNtAQFwl/Aaq9wl9DN7LB7ebUSvA=; b=nHafFtCScek1eJo6JdiDtFDOQdUqzUrXQyL6D6xDAin85rAMQSTpJy26aXxOBVaPHo N0yn8BHw2seTRhfUXYMLkzlM1H38PPVD9CdhQma8TYGiSW30DREH1Z1MvYs2ijRCo+F4 25CZf2m0IXyImsmWm1NzEc/TMNQCVnZCn9KvHPd9yRHHXFKIj+MHck3etxxdtvletDyK vKfd1OZB8LAHR8TBJTq69PMlnSPu8wMEbUuVRhYfPa1EiMlKrtpE25/FhKbxC4TcJBBX uluqii3ue6ybRNCUDPUYPeIoj+1zZ2iyco9KwSF+8jj3YK5g4NfV4oHcqlNqUTXOr65V R6sw== X-Gm-Message-State: AOJu0Yxt07LFdkcvxDQ/F+r6plppo1s+Soy7Xc2BSftev5c6evJWC6kh PwjKVqjE4Da7Qbb5s3/DYGvq+ARPoXLpurkCUHp28qwkj5qjg1NPzSiBV0Q4JhDG63foKc+92zK /9jnQoSpxUrv331FITNUEYFDu4Z3fl/5xhbqDeY+5pKzqntjXzL8TWDXiLnwKKVT5W6RvPVMES0 a7pdPPGQi9CN2L44raqUj9JmYdQtwnRLdp4M0= X-Received: by 2002:ac2:5df3:0:b0:516:d14a:9692 with SMTP id 2adb3069b0e04-5220fc73612mr8214303e87.6.1715707373197; Tue, 14 May 2024 10:22:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGfMs5q0guOJ+dskCaYzJf9JyGlgL17aXy6n+NGrU7QqwV2iUbDZEYMWoNyyJ+At+TgE27JJw== X-Received: by 2002:ac2:5df3:0:b0:516:d14a:9692 with SMTP id 2adb3069b0e04-5220fc73612mr8214283e87.6.1715707372428; Tue, 14 May 2024 10:22:52 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-574eaace761sm632111a12.91.2024.05.14.10.22.51 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 May 2024 10:22:51 -0700 (PDT) Date: Tue, 14 May 2024 19:22:18 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v3 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: <20240514192218.32382aae@elisabeth> In-Reply-To: <20240511152008.421750-3-jmaloy@redhat.com> References: <20240511152008.421750-1-jmaloy@redhat.com> <20240511152008.421750-3-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: MSS5CG5WD7IPIGUCEPG435G4UFFDKEUP X-Message-ID-Hash: MSS5CG5WD7IPIGUCEPG435G4UFFDKEUP X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Sat, 11 May 2024 11:20:07 -0400 Jon Maloy wrote: > From linux-6.9.0 the kernel will contain > commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). > > This new feature makes is possible to call recv_msg(MSG_PEEK) and make > it start reading data from a given offset set by the SO_PEEK_OFF socket > option. This way, we can avoid repeated reading of already read bytes of > a received message, hence saving read cycles when forwarding TCP > messages in the host->name space direction. > > In this commit, we add functionality to leverage this feature when > available, while we fall back to the previous behavior when not. > > Measurements with iperf3 shows that throughput increases with 15-20 > percent in the host->namespace direction when this feature is used. > > Signed-off-by: Jon Maloy > > --- > v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. > - Moved initial set_peek_offset(0) to only the locations where the socket is set > to ESTABLISHED. > - Removed the per-packet synchronization between sk_peek_off and > already_sent. Instead only doing it in retransmit situations. > - The problem I found when trouble shooting the occasionally occurring > out of synch values between 'already_sent' and 'sk_peek_offset' may > have deeper implications that we may need to be investigate. > > v3: - Rebased to most recent version of tcp.c, plus the previous > patch in this series. > - Some changes based on feedback from PASST team > --- > tcp.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++----------- > 1 file changed, 63 insertions(+), 14 deletions(-) > > diff --git a/tcp.c b/tcp.c > index 21cbfba..8297812 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -520,6 +520,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > > +/* Does the kernel support TCP_PEEK_OFF? */ > +static bool peek_offset_cap; > + > /* sendmsg() to socket */ > static struct iovec tcp_iov [UIO_MAXIOV]; > > @@ -535,6 +538,20 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > > +/** > + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported > + * @s: Socket to update > + * @offset: Offset in bytes > + */ > +static void tcp_set_peek_offset(int s, int offset) > +{ > + if (!peek_offset_cap) > + return; > + > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > + err("Failed to set SO_PEEK_OFF"); It would be nice to say on which socket and at which offset, in case it failed. > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection state > * @events: Current connection events > @@ -1280,11 +1297,14 @@ static void tcp_revert_seq(struct tcp_frame_ref *frame_ref, int first, int last) > continue; > > conn->seq_to_tap = frame_ref[i].seq; > + tcp_set_peek_offset(conn->sock, > + conn->seq_to_tap - conn->seq_ack_from_tap); > > if (SEQ_GE(conn->seq_to_tap, conn->seq_ack_from_tap)) > continue; > > conn->seq_to_tap = conn->seq_ack_from_tap; > + tcp_set_peek_offset(conn->sock, 0); > } > } > > @@ -2203,42 +2223,52 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; > int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; > int sendlen, len, dlen, v4 = CONN_V4(conn); > + uint32_t max_send, seq, already_sent; > int s = conn->sock, i, ret = 0; > struct msghdr mh_sock = { 0 }; > uint16_t mss = MSS_GET(conn); > - uint32_t already_sent, seq; > struct iovec *iov; > > + /* How much have we read/sent since last received ack ? */ > already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; > - Spurious change. > if (SEQ_LT(already_sent, 0)) { > /* RFC 761, section 2.1. */ > flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > conn->seq_ack_from_tap, conn->seq_to_tap); > conn->seq_to_tap = conn->seq_ack_from_tap; > already_sent = 0; > + tcp_set_peek_offset(s, 0); > } > > - if (!wnd_scaled || already_sent >= wnd_scaled) { > + /* How much are we still allowed to send within current window ? */ > + max_send = conn->seq_ack_from_tap + wnd_scaled - conn->seq_to_tap; I'm not sure about the purpose of this whole part of the patch, even if I try to see it in the context of 3/3. > + if (SEQ_LE(max_send, 0)) { > + flow_trace(conn, "Empty window: win: %u, sent: %u", > + wnd_scaled, conn->seq_to_tap); > conn_flag(c, conn, STALLED); > conn_flag(c, conn, ACK_FROM_TAP_DUE); > return 0; > } > > - /* Set up buffer descriptors we'll fill completely and partially. */ > - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); > + /* Set up buffer descriptors to fill completely or partially. */ Spurious change... or did you really mean to change this comment? > + fill_bufs = DIV_ROUND_UP(max_send, mss); > if (fill_bufs > TCP_FRAMES) { > fill_bufs = TCP_FRAMES; > iov_rem = 0; > } else { > - iov_rem = (wnd_scaled - already_sent) % mss; > + iov_rem = max_send % mss; > } > > - mh_sock.msg_iov = iov_sock; > - mh_sock.msg_iovlen = fill_bufs + 1; > - > - iov_sock[0].iov_base = tcp_buf_discard; > - iov_sock[0].iov_len = already_sent; > + /* Prepare iov according to kernel capability */ > + if (!peek_offset_cap) { > + mh_sock.msg_iov = iov_sock; > + iov_sock[0].iov_base = tcp_buf_discard; > + iov_sock[0].iov_len = already_sent; > + mh_sock.msg_iovlen = fill_bufs + 1; > + } else { > + mh_sock.msg_iov = &iov_sock[1]; > + mh_sock.msg_iovlen = fill_bufs; > + } > > if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || > (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { > @@ -2279,7 +2309,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > return 0; > } > > - sendlen = len - already_sent; > + sendlen = len; > + if (!peek_offset_cap) > + sendlen -= already_sent; > + > if (sendlen <= 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2449,7 +2482,9 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, > flow_trace(conn, > "fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); > + Spurious change. > conn->seq_to_tap = max_ack_seq; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > } > > @@ -2542,6 +2577,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, > conn->seq_ack_to_tap = conn->seq_from_tap; > > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > > /* The client might have sent data already, which we didn't > * dequeue waiting for SYN,ACK from tap -- check now. > @@ -2622,6 +2658,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, > goto reset; > > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > > if (th->fin) { > conn->seq_from_tap++; > @@ -2788,7 +2825,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, > union sockaddr_inany sa; > socklen_t sl = sizeof(sa); > union flow *flow; > - int s; > + int s = 0; > > if (c->no_tcp || !(flow = flow_alloc())) > return; > @@ -2875,6 +2912,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) > flow_dbg(conn, "ACK timeout, retry"); > conn->retrans++; > conn->seq_to_tap = conn->seq_ack_from_tap; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > tcp_timer_ctl(c, conn); > } > @@ -3166,7 +3204,8 @@ static void tcp_sock_refill_init(const struct ctx *c) > */ > int tcp_init(struct ctx *c) > { > - unsigned b; > + unsigned int b, optv = 0; > + int s; > > for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) > tc_hash[b] = FLOW_SIDX_NONE; > @@ -3190,6 +3229,16 @@ int tcp_init(struct ctx *c) > NS_CALL(tcp_ns_socks_init, c); > } > > + /* Probe for SO_PEEK_OFF support */ > + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); > + if (s < 0) { > + warn("Temporary TCP socket creation failed"); > + } else { > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) > + peek_offset_cap = true; > + close(s); > + } > + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); Maintaining the extra newline before the return would be nice. The changes actually related to SO_PEEK_OFF look good to me otherwise. > return 0; > } > -- Stefano