From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 78D675A02EF for ; Wed, 15 May 2024 22:23:21 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1715804600; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f12Bf/OXHeKFwwQUukUqaTIbp19BbBD858qqZf8wDS0=; b=K6eIr4ZovPfEdqipnTckaygjlmRB9TfeoG2qtLYZPxuinqnAjw688C8db7pRqUJ2ewiXWi +amFvq9aSKu/H+cm207hjJ3L52uAjorWATFBtrHTzbcbO5DWpSZJcRAUs0b5LEHod1fJxs x9EdU2IQydBO1S3UGdkCJI3i6sMZ7vc= Received: from mail-ej1-f69.google.com (mail-ej1-f69.google.com [209.85.218.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-66-MkvFLhH0MxWAMd5jcJ1r2g-1; Wed, 15 May 2024 16:23:18 -0400 X-MC-Unique: MkvFLhH0MxWAMd5jcJ1r2g-1 Received: by mail-ej1-f69.google.com with SMTP id a640c23a62f3a-a59fbf2bacaso488416566b.0 for ; Wed, 15 May 2024 13:23:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715804596; x=1716409396; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=f12Bf/OXHeKFwwQUukUqaTIbp19BbBD858qqZf8wDS0=; b=J6Qbj9aJ7bAujn44WOxLQuRgVhSd+J5EBD5FYPP1uaav4KL22/i2OPnuQdWeftxt1G 4nrEUzbVWaP2vENLnTvzDD2mtFcErb9yBBI0xpXGgNeD4E30iAcOBPhbhdG/LjwiJQ/J xzAJXAX8AB1CZ1Azg/+p8dMG5o0nseW7qyr69AvE17JvgrRFAYw2KLOAGwgglN5b9DJy RZtjpNwqCKU/ufzyG077kkhKGsmRNULunGm3S5f34b59FqAWMKefhdBwGt2uZA31NQg7 erBdhig/tqQg5NAX9sEk+gDqCa8+BUBVGG2otdWx9JTBCAPl2MgNHeK8r40jnQobcNF8 MWNg== X-Gm-Message-State: AOJu0Yy1WbR8yiu8BJvtwRLi8s5nhPgMUqkQm8WoyftQuIOqKRAT2KMp kV3zRx04WPtsnEP9NJExPIIeyOOg5G7LgLbi4NKTtYvnneitGHN51HspqDiO44869VKc/WqYWmC TXbaGIjm9yBjqnnEKhcI+HMp3fF/Qy0bB8iJfcyEFSprL6WgSo1X53o2f7klQW2kX3J/TuAXlDw ra2SC1ngT1IEkIiigBq201DshxYX5nfUKspg4= X-Received: by 2002:a17:906:c452:b0:a59:9da0:cc1 with SMTP id a640c23a62f3a-a5a2d65fcf7mr1800860066b.58.1715804596104; Wed, 15 May 2024 13:23:16 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGOnQIL2NAyRWMY6hsN3eLPvlpPOAHZMzwW1UgoKahMO9xJn/62OCiGdpr+8E9hQiR0lAvquw== X-Received: by 2002:a17:906:c452:b0:a59:9da0:cc1 with SMTP id a640c23a62f3a-a5a2d65fcf7mr1800857766b.58.1715804595361; Wed, 15 May 2024 13:23:15 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a5a1787c63bsm899780966b.51.2024.05.15.13.23.13 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 15 May 2024 13:23:14 -0700 (PDT) Date: Wed, 15 May 2024 22:22:39 +0200 From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available Message-ID: <20240515222239.550e2adb@elisabeth> In-Reply-To: <20240515153429.859185-3-jmaloy@redhat.com> References: <20240515153429.859185-1-jmaloy@redhat.com> <20240515153429.859185-3-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: OJ6ZK36UBDX23BUVNQTSDBH3LQBEHUAL X-Message-ID-Hash: OJ6ZK36UBDX23BUVNQTSDBH3LQBEHUAL X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, lvivier@redhat.com, dgibson@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Just two nits, I would be fine applying this as it is: On Wed, 15 May 2024 11:34:28 -0400 Jon Maloy wrote: > From linux-6.9.0 the kernel will contain > commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). > > This new feature makes is possible to call recv_msg(MSG_PEEK) and make > it start reading data from a given offset set by the SO_PEEK_OFF socket > option. This way, we can avoid repeated reading of already read bytes of > a received message, hence saving read cycles when forwarding TCP > messages in the host->name space direction. > > In this commit, we add functionality to leverage this feature when > available, while we fall back to the previous behavior when not. > > Measurements with iperf3 shows that throughput increases with 15-20 > percent in the host->namespace direction when this feature is used. > > Signed-off-by: Jon Maloy > > --- > v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. > - Moved initial set_peek_offset(0) to only the locations where the socket is set > to ESTABLISHED. > - Removed the per-packet synchronization between sk_peek_off and > already_sent. Instead only doing it in retransmit situations. > - The problem I found when trouble shooting the occasionally occurring > out of synch values between 'already_sent' and 'sk_peek_offset' may > have deeper implications that we may need to be investigate. > > v3: - Rebased to most recent version of tcp.c, plus the previous > patch in this series. > - Some changes based on feedback from PASST team > > v4: - Some small changes based on feedback from Stefan/David. > --- > tcp.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 50 insertions(+), 8 deletions(-) > > diff --git a/tcp.c b/tcp.c > index 976dba8..4163bf9 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -511,6 +511,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; > > +/* Does the kernel support TCP_PEEK_OFF? */ > +static bool peek_offset_cap; > + > /* sendmsg() to socket */ > static struct iovec tcp_iov [UIO_MAXIOV]; > > @@ -526,6 +529,20 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, > int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; > int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; > > +/** > + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported > + * @s: Socket to update > + * @offset: Offset in bytes > + */ > +static void tcp_set_peek_offset(int s, int offset) > +{ > + if (!peek_offset_cap) > + return; > + > + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) > + err("Failed to set SO_PEEK_OFF to %u in socket %i", offset, s); I thought we'd get a format warning if you use %u to print a signed value, but no, gcc seems to be happy with it. > +} > + > /** > * tcp_conn_epoll_events() - epoll events mask for given connection state > * @events: Current connection events > @@ -2197,14 +2214,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > uint32_t already_sent, seq; > struct iovec *iov; > > + /* How much have we read/sent since last received ack ? */ > already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; > - I still maintain that dropping this newline is a spurious change, but if you really dislike it, I don't have a strong preference to keep it, either. > if (SEQ_LT(already_sent, 0)) { > /* RFC 761, section 2.1. */ > flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", > conn->seq_ack_from_tap, conn->seq_to_tap); > conn->seq_to_tap = conn->seq_ack_from_tap; > already_sent = 0; > + tcp_set_peek_offset(s, 0); > } > > if (!wnd_scaled || already_sent >= wnd_scaled) { > @@ -2222,11 +2240,16 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > iov_rem = (wnd_scaled - already_sent) % mss; > } > > - mh_sock.msg_iov = iov_sock; > - mh_sock.msg_iovlen = fill_bufs + 1; > - > - iov_sock[0].iov_base = tcp_buf_discard; > - iov_sock[0].iov_len = already_sent; > + /* Prepare iov according to kernel capability */ > + if (!peek_offset_cap) { > + mh_sock.msg_iov = iov_sock; > + iov_sock[0].iov_base = tcp_buf_discard; > + iov_sock[0].iov_len = already_sent; > + mh_sock.msg_iovlen = fill_bufs + 1; > + } else { > + mh_sock.msg_iov = &iov_sock[1]; > + mh_sock.msg_iovlen = fill_bufs; > + } > > if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || > (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { > @@ -2267,7 +2290,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > return 0; > } > > - sendlen = len - already_sent; > + sendlen = len; > + if (!peek_offset_cap) > + sendlen -= already_sent; > + > if (sendlen <= 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2438,6 +2464,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, > "fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); > conn->seq_to_tap = max_ack_seq; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > } > > @@ -2530,6 +2557,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, > conn->seq_ack_to_tap = conn->seq_from_tap; > > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > > /* The client might have sent data already, which we didn't > * dequeue waiting for SYN,ACK from tap -- check now. > @@ -2610,6 +2638,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, > goto reset; > > conn_event(c, conn, ESTABLISHED); > + tcp_set_peek_offset(conn->sock, 0); > > if (th->fin) { > conn->seq_from_tap++; > @@ -2863,6 +2892,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) > flow_dbg(conn, "ACK timeout, retry"); > conn->retrans++; > conn->seq_to_tap = conn->seq_ack_from_tap; > + tcp_set_peek_offset(conn->sock, 0); > tcp_data_from_sock(c, conn); > tcp_timer_ctl(c, conn); > } > @@ -3154,7 +3184,8 @@ static void tcp_sock_refill_init(const struct ctx *c) > */ > int tcp_init(struct ctx *c) > { > - unsigned b; > + unsigned int b, optv = 0; > + int s; > > for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) > tc_hash[b] = FLOW_SIDX_NONE; > @@ -3178,6 +3209,17 @@ int tcp_init(struct ctx *c) > NS_CALL(tcp_ns_socks_init, c); > } > > + /* Probe for SO_PEEK_OFF support */ > + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); > + if (s < 0) { > + warn("Temporary TCP socket creation failed"); > + } else { > + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) > + peek_offset_cap = true; > + close(s); > + } > + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); > + > return 0; > } > -- Stefano